Summary of the standard string libraries
Python's standard library offers a number of modules with additional string processing features.
string
: The string module contains constants that decompose the ASCII characters into letters, numbers, whitespace, and so on. It contains the full definition of the formatter that is used by thestr.format()
method. We'll look at this in the next section. It also contains theTemplate
class which defines a string template into which values can be interpolated.re
: The regular expression library allows us to define a pattern that can be used to parse input strings. We'll look at this in the next section.difflib
: Thedifflib
module is used to compare sequences of strings, typically from text files. There are a number of comparison algorithms available in this module.textwrap
: We can use thetextwrap
module to format large blocks of text.unicodedata
: Theunicodedata
module provides functions for determining what kind of Unicode character is present. Unicode Standard Annex 44 defines a collection of properties that apply to the Unicode characters. One commonly-used function is the general category of a character; this includes simple Latin rules like "Lu" for uppercase letter or "Nd" for decimal number. The general category codes also include "Sk" which is for non-letter like modifier symbols.stringprep
: This is an implementation of RFC 3454, which prepares Unicode text strings in order to support sensible string comparisons.
Using the re module to parse strings
Regular expressions give us a simple way to specify a set of related strings by describing the pattern they have in common. A regular expression is an element of set theory that could (in theory) define the set of all possible related strings. The theoretical matching process would be a quick check to see if a given string in this set of all possible strings is generated by the expression. Since the set of all possible strings generated from a pattern could potentially be infinite, this isn't how things work in practice.
When we use the re
module, we generally do three things. Firstly, we specify the pattern string. Secondly, we compile the pattern into an object that efficiently determines if and where a given string matches the pattern. Finally, we repeatedly use the pattern
object to efficiently match, search, or parse the given input strings.
As a concrete example, we need to process input which contains lines like this: Birth Date: 3/8/1987
or Birth Date: 1/18/59
. Note that the number of digits in each date and the amount of whitespace is allowed to vary.
We may perform any of the following three common kinds of processing:
- A matching regular expression might be
Birth Date:\s+\d+/\d+/\d+
. The\s+
subexpression means one or more spaces. The\d+
subexpression of this means one or more digits. A match pattern is usually designed to match the whole string. - A searching regular expression might be
\d+/\d+/\d+
. This search pattern includes one or more digits,\d+
, and literal punctuation,/
. This expression describes a substring that can be found somewhere within the given string. - A parsing pattern separates the various digit groups from the surrounding context. This is a slight modification to one of the previous examples to include
()
, that specifies what to capture. We might use(\d+)/(\d+)/(\d+)
to show that the digit groups should be extracted for further processing.
We can accomplish these matching, searching, and parsing operations with the re
module in Python.
Using regular expressions
The general recipe for using regular expressions in a Python program has three essential steps. Of course, we must use import re to include the required module. The three steps are:
- Define the pattern string. This will almost always be a raw string, starting with
r"
, because the regular expression string will be full of\
characters that we don't want to be treated as escapes by Python. Because\
begins a Python language escape, if we want to write a standalone\
character, we have to double them up in a non-raw string. It is better to use a raw string to writer"\d+/\d+/\d+"
than\\d+/\\d+/\\d+
. - Evaluate the
re.compile()
function to create apattern
object. The resulting object will do the real work of matching a given target string against the regular expressionpattern
object.We can combine the pattern and the compile in one statement like this:
>>> date_pattern = re.compile(r"Birth Date:\s+(.*)")
- Use the compiled
pattern
object to match or search the candidate strings. The result of a successful match or search will be aMatch
object. We can then use the match object, where necessary, to extract fields. For example:>>> match = date_pattern.match("Should Not Match") >>> match >>> match = date_pattern.match("Birth Date: 3/8/87") >>> match <_sre.SRE_Match object at 0X82e60>
In the first example, the
date_pattern.match()
expression returnedNone
because the given string didn't match the regular expression. In the second example, the given string did match the regular expression pattern, and aMatch
object was created. If our regular expression is used for parsing, we'll interrogate theMatch
object to get the various substrings.
When we have a Match
object, it can have captured substrings that match parts of the overall pattern. We'll usually make use of the various group()
methods to get substrings. Here are some examples:
>>> match.group() 'Birth Date: 3/8/87' >>> match.group(1) '3/8/87' >>> match.groups() ('3/8/87',)
In the first example, we saw all of the matching content. In the second example, we saw the value of group number one, the first portion of the regular expression wrapped in ()
. In the final example, we saw all ()
-wrapped groups in the regular expression. Since there was only one such group, the value of groups()
is a single-item tuple
with matching text.
Creating a regular expression string
There are numerous rules for creating regular expression patterns, and we'll look at a few of them here. The definitive list is in the Python Standard Library documentation for the re
module, in section 6.2.1. For more information on this topic, see Mastering Python Regular Expressions from Packt Books. See https://www.packtpub.com/application-development/mastering-python-regular-expressions.
First we'll look at the "atomic" regular expressions. Then we'll look at the rules for combining regular expressions into a larger regular expression. Here are some simple, atomic regular expressions:
- Any single character. With a few exceptions, this means just about any printable character. The exceptions are the characters which have special meaning in the regular expression language, including
.
,*
,?
,(
,)
,[
,]
,|
among others. - A
.
matches any character. To match a period, the\
escape character is used:\.
matches a period. - Some escape sequences match whole classes of characters.
\d
matches any digit.\D
matches any non-digit character.\s
matches any whitespace character.\S
matches any non-space character.\w
matches any word character.\W
matches any non-word character. By default, these follow the Unicode rules. We can override this to follow a considerably simpler set of ASCII-only rules.
There are some suffixes that we can put after a regular expression.
- A
*
suffix means the previous expression can be matched zero or more times. This has the effect of making the previous RE pattern optional as well as eligible for repetition. - A
+
suffix means the previous expression can be matched one or more times. This means that the previous pattern is mandatory and can also be repeated. - A
?
suffix means the previous expression is optional; it can be matched zero times or just one time. - To actually match a suffix character, use the
\
escape. For example,\*
matches an asterisk.
We can combine inpidual expressions into larger patterns. Here are some common techniques for doing this:
- A sequence of regular expressions is a regular expression. We simply put the expressions one after another inside the pattern string. When we write an expression like
Birth
it's a sequence of five atomic expressions which match each inpidual character. - A sequence of characters in
[]
matches any one of the given characters. This is generally used with single-character expressions; often we'll see constructs like[a-zA-Z0-9_]
to match any letter or digit or_
. To match multiple-character strings we use a suffix after the[]
. We can user"[0-9a-fA-F]+"
to match one or more hexadecimal digits. To make-
one of the alternative characters, it must be first or last within the list of characters inside the[]
. - Two regular expressions separated by
|
is a regular expression. Either one can match. We might be looking at a pattern liketrue|false
. We must match one of the two regular expressions: eithertrue
orfalse
. To match the pipe character,|
, it must be escaped like this\|
. - A regular expression surrounded by
()
's is a regular expression. It's also preserved as a group, so that we can use the matching characters while parsing. To match parentheses, they must be escaped,\(
matches a(
. Substrings captured via()
are available via thegroup()
method of the match object.
These rules help us examine the details of a specific pattern. Here's a pattern we might use to parse some input:
r"(\w+)\s*[=:]\s*(.*)"
This is a regular expression which is a sequence of 5 regular expressions.
- The characters
(\w+)
make a regular expression,\w
, with a+
suffix enclosed in()
. This matches any sequence of one or more word characters. \s*
is a regular expression. It's a simple expression\s
with a suffix of*
. It matches zero or more whitespace characters. This means that spaces are optional after the initial word. If spaces are present, any number may be used.[=:]
is a regular expression built from two single-character expressions,=
and:
. It matches either one of the two characters.\s*
is used a second time to permit any number of whitespace characters between the=
or:
and the value.- The final regular expression is
(.*)
which matches any sequence of characters.
When we use this regular expression, if a Match
object is created, it will have two groups. We can then extract the name and value matched by the patterns within this regular expression.
Working with Unicode, ASCII, and bytes
The re
module works with bytes as well as Unicode strings. We must provide proper pattern literals depending on which kind of string we're working with. With Unicode, we use pattern literals with the r
prefix: r"\w+"
. With bytes, we use the rb
prefix, rb"\w+"
; the rb
means raw bytes instead of raw Unicode characters.
The rules for the character classes are, of course, different. A Unicode string that matches the "\w+"
pattern can have any of a wide variety of Unicode "word" characters. A bytes object that uses the "\w+"
pattern will match ASCII characters from the set a-z
, A-Z
, 0-9
and _
.
Tip
We must explicitly use bytes for the pattern literals when parsing, searching, or matching with bytes.
We can use an option in the re.compile()
to force a Unicode pattern to follow the simplified ASCII rules. If we write re.compile(r"\w+", re.ASCII)
we've replaced the default Unicode assumption for \w
with the ASCII rule for \w
even though we're doing Unicode string matching.
Using the locale module for personalization
When looking at the str.format()
method, we saw that the n
format type produced a number with formatting based on the user's locale. This means that the formatting varies according to the OS locale settings. Users in different countries will see that their personal locale settings are used properly.
Here's an example of using the locale
module to get locale-specific formatting:
>>> import locale >>> locale.setlocale(locale.LC_ALL,'') 'en_US.UTF-8' >>> "{0:n}".format(23.456) '23.456' >>> locale.setlocale(locale.LC_ALL,'sv_SE') 'sv_SE' >>> "{0:n}".format(23.456) '23,456'
This script used the locale
module to set the Python locale to match the prevailing OS locale. The locale is reported to be English as used in the US (en_US
) and the preferred Unicode encoding is shown as UTF-8.
The formatted value of 23.456
showed up with a US English decimal point. This fits the expectations of users in the US.
We then switched the locale to Sweden. The language was reported as sv_SE
, which means the Swedish language, as used in Sweden. The formatted value switched to 23,456
with a decimal comma, which is appropriate for users in Sweden.
Let's continue this example, and use the locale.currency()
formatting function:
>>> locale.currency(23.54) '23,54 kr'
The amount was formatted using ,
for the decimal separator and kr
as the local currency in Sweden. The locale module includes the currency names.
Note that we provided the numeric value, 23.54
, in Python syntax, which does not vary by locale. Python floating-point literals always use decimal points. Only the output string from the currency()
function uses the ,
character as a decimal place separator.