Python Essentials
上QQ阅读APP看书,第一时间看更新

Splitting, partitioning, and joining strings

In Chapter 2, Simple Data Types, we looked at different processing methods for a string object. We can transform a string into a new string, create strings from non-string data, access a string to determine properties or locations within the string, and parse a string to decompose it.

In many cases, we need to extract elements of a string. The split() method is used to locate repeating list-like structures within a string. The partition() method is used to separate the head and tail of a string.

For example, given a string of the form "numerator=355,denominator=115" we can use these two methods to locate the various names and values. Here's how we can decompose this complex string into pieces:

>>> text="numerator=355,denominator=115"
>>> text.split(",")
['numerator=355', 'denominator=115']
>>> items= _
>>> items[0].partition("=")
('numerator', '=', '355')
>>> items[1].partition("=")
('denominator', '=', '115')

We've used the split(",") method to break the longer string on each , character, creating a list object which has two substrings. The REPL automatically assigns all expression results to a variable named _. We assigned the object to the items variable because the value of _ gets overwritten by each expression statement.

We used the partition("=") method on each item in the items variable to break the assignment down into name, =, and value. A more complex application would probably perform more complex processing on the names and values.

The join() method is the inverse of the split() method. This works with a sequence of string objects to create a single long string from many smaller strings. Here's an example of using a tuple of strings to create a longer string:

>>> options = ("x", "y", "z")
>>> "|".join(options)
'x|y|z'

We've created a sequence of three strings and assigned it to a variable named options. We then used the string "|" to join the items in the options sequence. The result is a longer string with the items separated by the given string.

The split() and join() methods work well with singletons. If we try to split a single item with no punctuation, we get a sequence with a single item. If we join a singleton item, the separator will not be used.

Python's string methods give us the tools to handle a variety of string parsing and decomposition. For a more general solution, we'll have to resort to even more powerful tools. We'll look at the regular expression module, re, later.

If we want to create complex strings, we use the format() method. We'll look at this next.

Using the format() method to make more readable output

Sophisticated string creation can be done with the format() method. We create a template string and values which can be plugged into the template. Here's an example of how this works:

>>> c=42
>>> "{0:d}°C is {1:.1f}°F".format(c, 32+9*c/5)
'42°C is 107.6°F'

We've created a variable, c, with a value of 42. We've used a template, "{0:d}°C is {1:.1f}°F", to format two values. The argument value with an index of 0 is c, the argument value with an index of 1 is the value of the expression 32+9*c/5.

The template string includes literal characters, plus replacement fields. Each replacement field is surrounded by {}. The replacement field has two components with a syntax of {index:specification}. The index component identifies which item is taken from the positional arguments to the format() method. The specification component shows us how to format the selected object.

The example gives two specifications. One specification is the character d, which is the decimal integer conversion. The other is the slightly more complex .1f, which is a floating-point conversion with one digit to the right of the decimal point.

There is considerable sophistication available in the format specifications. There are eight fields to a format specification. The syntax gloss looks like this:

[[fill]align][sign][#][0][width][,][.precision][type]

We've surrounded each field with [] to group the names visually. Note that all the fields are actually optional and have default values.

We'll summarize the fields from right to left in order of importance.

  • Type: This specifies the overall type of conversion. Depending on the kind of Python object, there are a number of type codes available:
    • For string values, the type code of s is used.
    • For integer values, type codes of d, n, b, o, x, or X can be used. These provide decimal, locale-aware numbers, binary, octal, or hexadecimal output.
    • For float values, type codes are e, E, f, F, g, G, n, or %. The e formats provide explicit exponents. The f codes show float values with no exponent. The g values are called general and choose e or f, depending on the size of the number. The n code is locale-aware, using the locale settings for floating-point presentation. The % multiplies by 100 and includes the % symbol.
  • Precision: The .precision value is only used for floating-point formats. It's the number of positions to the right of the decimal point.
  • The , separator: If a , character is used, then US-style , as 1,000's separators are included. This isn't locale-aware, so it can't be overridden by the OS and the Python locale module.
  • Width: If omitted, the number is formatted as wide as necessary. If provided, the number is filled out to this width. By default, the fill uses leading spaces, but this can be changed by providing values for the fill and align fields.
  • 0: This forces filling to the required width with leading zeroes. This is the same as a fill and align of 0=.
  • #: This is used with b, o, and x formatting to include a prefix of 0b, 0o, or 0x in front of the number.
  • Sign: By default, positive numbers have no sign and negative numbers have a leading -. Providing a sign field of + means that all signs are shown explicitly. Providing a sign field of - means that an extra space is included for positive numbers, assuring that positive and negative numbers will align in columns when printed using a fixed-width font.
  • Fill and align: This fills up the space to the value of the width field. If we provide align without a specific fill character, the default character is a space. We can't provide a fill character on its own, though. There are four codes we can use:
    • < or fill< will push the data to the left, and the filling will be on the right.
    • > or fill> will push the data to the right, the fill character will be used on the left.
    • ^ or fill^ will center the data, filling both left and right.
    • = or fill= will put the sign first, and the fill character will be used after the sign. This will make the signs more prominent in a column of numbers.

Here's an example that uses a fairly complex format specification:

>>> amount=Decimal("234.56")
>>> "Pay: ${0:*>10n} dollars".format(amount)
'Pay: $****234.56 dollars'

We've created an object, amount, with a Decimal value. We then used a format specification of *>10n on this number. This used leading * characters to fill out the number to 10 characters.