Text++ Archives

  1. Android Text Archives
  2. Classic Poetry Text Archives
  3. Ancient Text Archives

Download Corel Draw and EPS vector art, Arabic and Urdu calligraphy, wedding cards and all Islamic vectors for FREE! ON SCREEN TEXT. ARCHIVE: Atlanta, Georgia - 5 January 2021. 1 Former American Major League Baseball player Hank Aaron waits to be vaccinated. ARCHIVE: Atlanta, Georgia – 5 February 2020. Hank Aaron, his wife Billye Aaron and Atlanta Technical College students at building's unveiling. ARCHIVE: Locations and dates unknown.

Only a few more weeks to wait until Cambria Hebert’s “Text” will launch on November 8, 2013. Until then, you’ll have to be satisfied with a peek at this gorgeous cover, designed by Mae I Design. This is a NA contemporary novel that is part of Cambria’s “Take It Off” series. Due to graphic language and sexual content this book is recommended for readers 18+.

Here’s the synopsis:

One text can change everything.

Honor Calhoun never thought her life would ever be like the books she writes for a living. One morning while out for a run, she learns not all bad things are plots in novels. Some horrors can actually come true.

She faces off with a persistent attacker, holds her own, but in the end is taken hostage and thrown into a hole. In the middle of the woods.

But Honor didn’t go down there alone.

She took her kidnapper’s phone with her. With a spotty signal and a dying battery, her hope is slim.

Nathan Reed is an active duty Marine stationed at a small reserve base in Pennsylvania. All he wants is a calm and uneventful duty station where he can forget the memories of his time in a war-torn country.

But a single text changes everything.

Nathan becomes Honor’s only hope for survival, and he has to go against the clock, push aside his past, and take on a mission for a girl he’s never met.

Both of them want freedom… but they have to survive long enough to obtain it.

Add “Text” to your Goodreads TBR list!
Text is on Goodreads

A Regular Expression Matcher

Code by Rob Pike

Exegesis by Brian Kernighan

Draft version Jan 28 2007

Introduction

Beautiful code is likely to be simple -- clear and easy tounderstand. Beautitful code is likely to be compact -- just enough codeto do the job and no more -- but not cryptic, to the point where itcannot be understood. Beautiful code may well be general, solving abroad class of problems in a uniform way. One might even describe it aselegant, showing good taste and refinement.

In this chapter I will describe a piece of beautiful code,for matching regular expressions that meets all of these criteria.

Regular expressions are a notation for describing patterns of text,in effect a special-purpose language for pattern matching. Althoughthere are many variants, all share the idea that most characters in apattern match literal occurrences of themselves but some'metacharacters' have special meaning, for example '*' to indicatesome kind of repetition or [...] to mean any one character from theset within the brackets.

In practice, most searches in programs like text editors are forliteral words, so the regular expressions are often literal strings likeprint, which will match printf or sprint orprinter paper anywhere. In so-called 'wild cards' in filenamesin Unix and Windows, a * matches any number of characters, sothe pattern *.c matches all filenames that end in .c.There are many, many variants of regular expressions, even in contextswhere one would expect them to be the same. Jeffrey Friedl'sMastering Regular Expressions (O'Reilly, 2006) is an exhaustivestudy of the topic.

Regular expressions were invented by Stephen Kleene in themid-1950's as a notation for finite automata, and in fact they areequivalent to finite automata in what they represent. Regularexpressions first appeared in a program setting in Ken Thompson'sversion of the QED text editor in the mid-1960's. In 1967, Ken appliedfor a patent on a mechanism for rapid text matching based on regularexpressions; it was granted in 1971, one of the very first softwarepatents [US Patent 3,568,156, Text Matching Algorithm, March 2, 1971].

Regular expressions moved from QED to the Unix editor ed, and thento the quintessential Unix tool, grep, which Ken created by performingradical surgery on ed. Through these widely used programs, regularexpressions became familiar throughout the early Unix community.

Ken's original matcher was very fast because it combined twoindependent ideas. One was to generate machine instructions on the flyduring matching so that it ran at machine speed, not by interpretation.The other was to carry forward all possible matches at each stage, so itdid not have to backtrack to look for alternative potential matches.Matching code in later text editors that Ken wrote, like ed, used asimpler algorithm that backtracked when necessary. In theory this isslower but the patterns found in practice rarely involved backtracking,so the ed and grep algorithm and code were good enough for mostpurposes.

Subsequent regular expression matchers like egrep and fgrep addedricher classes of regular expressions, and focused on fast execution nomatter what the pattern. Ever fancier regular expressions becamepopular, and were included not only in C-based libraries but also aspart of the syntax of scripting languages like Awk and Perl.

The Practice of Programming

In 1998, Rob Pike and I were writing The Practice of Programming('TPOP'). The last chapter of the book, 'Notation', collected a numberof examples where a good notation led to better programs and betterprogramming. This included the use of simple data specifications(printf formats, for instance), and the generation of code from tables.

Given our Unix backgrounds and many years of experience withtools based on regular expression notation, we naturally wanted toinclude a discussion of regular expressions, and it seemed mandatory toinclude an implementation as well. Given our emphasis on tools, it alsoseemed best to focus on the class of regular expressions found in grep,rather than say those from shell wild cards, since we could also talkabout the design of grep itself.

Text archives

The problem was that any existing regular expression package was fartoo big. The local grep was over 500 lines long (about 10 bookpages). Open-source regular expressionpackages tended to be huge, roughly the size of the entire book, becausethey were engineered for generality, flexibility, and speed; none wasremotely suitable for pedagogy.

I suggested to Rob thatwe needed to find the smallest regular expression package that wouldillustrate the basic ideas while still recognizing a useful andnon-trivial class of patterns. Ideally, the code would fit on a singlepage.

Rob disappeared into his office, and at least as I remember it now,appeared again in no more than an hour or two with the 30 lines of Ccode that subsequently appeared in Chapter 9 of TPOP. That codeimplements a regular expression matcher that handles these constructs:

This is quite a useful class; in my own experience of using regularexpressions on a day-to-day basis, it easily accounts for 95 percentof all instances. In many situations, solving the right problem is abig step on the road to a beautiful program. Rob deserves great creditfor choosing so wisely, from among a wide set of options, a very smallyet important, well-defined and extensible set of features.

Rob's implementation itself is a superb example of beautiful code:compact, elegant, efficient, and useful. It's one of the best examplesof recursion that I have ever seen, and it shows the power of Cpointers. Although at the time we were most interested in conveying theimportant role of a good notation in making a program easier to use andperhaps easier to write as well, the regular expression code has alsobeen an excellent way to illustrate algorithms, data structures,testing, performance enhancement, and other important topics.

Implementation

In the book, the regular expression matcher is part of aprogram that mimics grep, but the regular expression code is completelyseparable from its surroundings. The main program is not interestinghere -- it simply reads from its standard input or from asequence of files, and prints those lines that contain a match of theregular expression, as does the original grep and many other Unix tools.

This is the matching code:

The function match(regexp, text) tests whether there is anoccurrence of the regular expression anywhere within the text; itreturns 1 if a match is found and 0 if not. If there is more than onematch, it finds the leftmost and shortest.

The basic operation of match is straightforward. If thefirst character of the regular expression is ^ (an anchoredmatch), any possible match must occur at the beginning of the string.That is, if the regular expression is ^xyz, it matchesxyz only if xyz occurs at the beginning of the text,not somewhere in the middle. This is tested by matching the rest of theregular expression against the text starting at the beginning, andnowhere else.

Otherwise, the regular expression might match anywhere within thestring. This is tested by matching the pattern against each characterposition of the text in turn. If there are multiple matches, only thefirst (leftmost) one will be identified. That is, if the regularexpression is xyz, it will match the first occurrence ofxyz regardless of where it occurs.

Notice that advancing over the input string is done with ado-while loop, a comparatively unusual construct in C programs.The occurrence of a do-while instead of a while shouldalways raise a question: why isn't the loop termination condition beingtested at the beginning of the loop, before it's too late, rather thanat the end after something has been done? But the test is correct here:since the * operator permits zero-length matches, we first haveto check whether a null match is possible.

The bulk of the work is done in the function matchhere(regexp,text), which tests whether the regular expression matches the textthat begins right here. The function matchhere operates byattempting to match the first character of the regular expression withthe first character of the text. If the match fails, there can be nomatch at this text position and matchhere returns 0. If thematch succeeds, however, it's possible to advance to the next characterof the regular expression and the next character of the text. This isdone by calling matchhere recursively.

Text

The situation is a bit more complicated because of some specialcases, and of course the need to stop the recursion. The easiest caseis that if the regular expression is at its end (regexp[0] '0'), then all previous tests have succeeded, and thus the regularexpression matches the text.

If the regular expression is a character followed by a *,matchstar is called to see whether the closure matches. Thefunction matchstar(c, regexp, text) tries to match repetitionsof the text character c, beginning with zero repetitions andcounting up, until it either finds a match of the rest of the text, orit fails and thus concludes that there is no match. This identifies a'shortest match', which is fine for simple pattern matching as in grep,where all that matters is to find a match as quickly as possible. A'longest match' is more intuitive and almost certain to be better for atext editor where the matched text will be replaced. Most modernregular expression libraries provide both alternatives, and TPOPpresents a simple variant of matchstar for this case,shown later.

If the regular expression is a $ at the end of the expression,then the text matches only if it too is at its end.

Otherwise, if we are not at the end of the text string (that is,*text!='0') and if the first character of the text stringmatches the first character of the regular expression, so far so good;we go on to test whether the next character of the regular expressionmatches the next character of the text by making a recursive call tomatchhere. This recursive call is the heart of the algorithmand is why the code is so compact and clean.

If all of these attempts to match fail, then there can be no matchat this point in the regular expression and the text, somatchhere returns 0.

This code really uses C pointers. At each stage of the recursion,if something matches, the recursive call that follows uses pointerarithmetic (e.g., regexp+1 and text+1) so that thenext function is called with the next character of the regularexpression and of the text. The depth of recursion is no more than thelength of the pattern, which in normal use is quite short, so there isno danger of running out of space.

Alternatives

This is a very elegant and well-written piece of code, but it's notperfect. What might one do differently? I might rearrangematchhere to deal with $ before *. Althoughit makes no difference here, it feels a bit more natural, and a goodrule is to do easy cases before hard ones.

In general, however, the order of tests is critical. For instance,in this test from matchstar:no matter what, we must advance over one more character of the textstring, so the increment in text++ must always be performed.

This code is careful about termination conditions. Generally, thesuccess of a match is determined by whether the regular expression runsout at the same time as the text does. If they do run out together,that indicates a match; if one runs out before the other, there is nomatch. This is perhaps most obvious in a line likebut subtle termination conditions show up in other places as well.

Android Text Archives

The version of matchstar that implements leftmostlongest matchingbegins by identifying a maximal sequence of occurrences of the inputcharacter c. Then it uses matchhere to try to extendthe match to the rest of the pattern and the rest of the text. Eachfailure reduces the number of c's by one and tries again,including the case of zero occurrences.

Consider the regular expression (.*),which matches arbitrary text within parentheses. Given the target texta longest match from the beginning will identify the entireparenthesized expression, while a shortest match will stop at the firstright parenthesis. (Of course a longest match beginning from the secondleft parenthesis will extend to the end of the text.)

Building On It

The purpose of TPOP was to teach good programming. At the time thebook was written, Rob and I were still at Bell Labs, so we did not havefirst-hand experience of how the book would be best used in a classroom.It has been gratifying to discover that some of the material does workwell in classes. I have used this code since 2000 as a vehicle forteaching important points about programming.

Classic Poetry Text Archives

First, it shows how recursion is useful and leads to clean code, ina new setting; it's not yet another version of Quicksort (orfactorial!), nor is it some kind of tree walk.

It's also a good example for performance experiments. Itsperformance is not very different from the system versions of grep,which shows that the recursive technique is not too costly and that it'snot worth trying to tune the code.

On the other hand, it is also a fine illustration of the importanceof a good algorithm. If a pattern includes several .*s, thestraightforward implementation requires a lot of backtracking, and insome cases will run very slowly indeed. (The standard Unix grep has thesame properties.) For example, the commandtakes about 20 seconds to process a 4 MB text file on a typical machine.An implementation based on converting a non-deterministic finiteautomaton to a deterministic automaton, as in egrep, will have muchbetter performance on hard cases -- the same pattern and the same inputis processed in less than a tenth of a second, and in general, therunning time is independent of the pattern.

Extensions to the regular expression class can form the basis of avariety of assignments. For example:

(1) Add other metacharacters, like + for one or more occurrences of the previous character, or ? for zero or onematches. Add some way to quote metacharacters, like $ tostand for a literal occurrence of $.

(2) Separate regular expression processing into a 'compilation'phase and an 'execution' phase. Compilation converts the regularexpression into an internal form that makes the matching code simpler orsuch that subsequent matching runs faster. This separation is notnecessary for the simple class of regular expressions in the originaldesign, but it makes sense in grep-like applications where the class isricher and the same regular expression is used for a large number ofinput lines.

(3) Add character classes like [abc] and [0-9],which in conventional grep notation match a or b orc and a digit respectively. This can be done in several ways,the most natural of which seems to be replacing the char*'s ofthe original code with a structure:and modifying the basic code to handle an array of these instead of anarray of characters. It's not strictly necessary to separatecompilation from execution for this situation, but it turns out to be alot easier. Students who follow the advice to pre-compile into such astructure invariably do better than those who try to interpret somecomplicated pattern data structure on the fly.

Writing clear and unambiguous specifications for character classesis tough, and implementing them perfectly is worse, requiring a lot oftedious and uninstructive coding. I have simplified this assignmentover time, and today most often ask for Perl-like shorthands such asd for digit and D for non-digit instead of theoriginal bracketed ranges.

(4) Use an opaque type to hide the RE structure and all theimplementation details. This is a good way to show object-orientedprogramming in C, which doesn't support much beyond this. In effect,one makes a regular expression class but with function names likeRE_new() and RE_match() for the methods instead of thesyntactic sugar of an object-oriented language.

(5) Modify the class of regular expressions to be like the wild cardsin various shells: matches are implicitly anchored at both ends,* matches any number of characters, and ? matchesany single character. One can modify the algorithm or map the inputinto the existing algorithm.

(6) Convert the code to Java. The original code uses C pointersvery well, and it's good practice to figure out the alternatives in adifferent language. Java versions use either String.charAt(indexing instead of pointers) or String.substring (closer tothe pointer version). Neither seems as clear as the C code, and neitheris as compact. Although performance isn't really part of this exercise,it is interesting to see that the Java implementation runs roughly sixor seven times slower than the C versions.

(7) Write a wrapper class that converts from regular expressions ofthis class to Java's Pattern and Matcher classes, which separate thecompilation and matching in a quite different way. This is a goodexample of the Adapter or Facade pattern, which puts a different face onan existing class or set of functions.

I've also used this code extensively to explore testing techniques.Regular expressions are rich enough that testing is far from trivial, butsmall enough that one can quickly write down a substantial collection oftests to be performed mechanically. For extensions like those justlisted, I ask students to write a large number of tests in acompact language (yet another example of 'notation') and use those testson their own code; naturally I use their tests on other students' codeas well.

Conclusions

I was amazed by how compact and elegant this code was when Rob Pikefirst wrote it -- it was much smaller and more powerful than I hadthought possible. In hindsight, one can see a number of reasons why thecode is so small.

First, the features are well chosen to be the most useful and togive the most insight into implementation, without any frills. Forexample, the implementation of the anchored patterns ^ and$ requires only 3 or 4 lines, but shows how to handle specialcases cleanly before handling the general cases uniformly. The closureoperation * is a fundamental notion in regular expressions andprovides the only way to handle patterns of unspecified lengths, so ithas to be present. But it would add no insight to also provide+ and ? so those are left as exercises.

Second, recursion is a win. This fundamental programming techniquealmost always leads to smaller, cleaner and more elegant code than theequivalent written with explicit loops, and that is the case here. Theidea of peeling off one matching character from the front of the regularexpression and from the text, then recursing for the rest, echoes therecursive structure of the traditional factorial or string lengthexamples, but in a much more interesting and useful setting.

Rob has told me that the recursion was not so much an explicitdesign decision as a consequence of how he approached the problem:given a pattern and a text, write a function that looks for a match;that in turn needs a 'matchhere' function; and so on.

'I have pretty vivid memories of watching the code almost write itselfthis way. The only challenge was getting the edge conditions right tobreak the recursion. Put another way, the recursion is not only theimplementation method, it's also a reflection of the thought processtaken when writing the code, which is partly responsible for the code'ssimplicity. Most important, perhaps, I didn't have a design when Istarted, I just began to code and saw what developed. Suddenly, I wasdone.'

Third, this code really uses the underlying language to good effect.Pointers can be mis-used, of course, but here they are used to createcompact expressions that naturally express the extraction of individualcharacters and advancing to the next character. The same effect can beachieved by array indexing or substrings, but in this code, pointers doa better job, especially when coupled with C idioms for auto-incrementand implicit conversion of truth values.

I don't know of another piece of code that does so much in so fewlines, while providing such a rich source of insight and further ideas.

Acknowledgments

Ancient Text Archives

Your name here...