Bitten by Unicode


by joseph carboni

Parsing Dollar Figures

One product of mine takes reports that come in as a table that’s been exported to PDF, which means text extraction. For dollar figures I find a prefixed dollar symbol and convert the number following it into a float. If there’s a hyphen in addition to the dollar symbol, it’s negative.

Right?

Well, when I was running reports through, negative values were coming out the other end as positive values.

Let’s look at this value as an example. (You can copy this into your interpreter and try it yourself.)

value = "‐$2,520.80"

Initially, I was using this regex to convert, retaining hyphens, dots, and any number from 0 to 9.

import re
converted_value = float(re.sub(r"[^-.0-9]", "", value))

Result: 2520.8

Positive? I found this confusing and thought that perhaps I didn’t understand what the hyphen in my regex was doing, so I tried taking the hyphen out of the regex, putting the negative case behind an if-statement and then hitting it with a multiplication of -1:

if value.startswith('-$'):
    converted_value = float(re.sub(r"[^.0-9]", "", value)) * -1

The result? The if-statement is never triggered!

Inspecting the hyphen

I was beginning to think I had taken the wrong crazy pills that morning. However, not long ago I read a chapter in Fluent Python by Luciano Ramalho (great book, by the way) about normalizing Unicode. So, I had been exposed recently to Unicode trickery.

I pulled in the standard library module unicodedata and starting checking things. I fed the standard hyphen character from my keypress as well as the hyphen from value into unicodedata.name().

>>> unicodedata.name("-")
HYPHEN-MINUS

>>> unicodedata.name(value[0])
HYPHEN

Whoa whoa whoa, what’s this!? These aren’t the same character, and I couldn’t tell by just looking at it. What’s the difference between HYPHEN-MINUS and HYPHEN?

HYPHEN-MINUS (U+002D)

  • Intended Use: This character is meant to serve as a general-purpose hyphen, minus sign, or dash. It’s a legacy character inherited from ASCII.
  • Appearance: It often appears as a small horizontal line, but the exact rendering can vary depending on the font and context.
  • Usage: It’s commonly used in programming, file names, and where a quick, generic hyphen or minus sign is needed.

HYPHEN (U+2010)

  • Intended Use: This character is explicitly designated as a hyphen, used to join words or split a word across lines.
  • Appearance: It has a consistent appearance across different fonts and contexts, ensuring it looks like a proper hyphen.
  • Usage: It is used in text where a typographically correct hyphen is required, such as in professionally typeset documents.

My Solution

I needed to do something to account for this and make sure to catch not only this variation of a hyphen, but any such variation. For that, I leveraged Unicode categories.

The category for hyphens is ‘Pd’ (punctuation, dash).

def is_hyphen(char: str) -> bool:
    return unicodedata.category(char) == 'Pd'

And now can correctly identify any preceding dash in my if-statement.

if is_hyphen(value[0]) and value[1] == "$":
    converted_value = float(re.sub(r"[^.0-9]", "", value)) * -1

Result: -2520.8

ABOUT THE AUTHOR

Joseph Carboni is a multifaceted programmer with a background in bioinformatics, neuroscience, and sales, now focusing on Python development. He developed a ribosomal loading model and contributed to a neuroscience paper before transitioning to a six-year sales career, enhancing his understanding of business and client relations. Currently, he’s a Python Developer at Shupe, Carboni & Associates, improving business processes, and runs Carboni Technology for independent tech projects. Joseph welcomes collaborations and discussions via LinkedIn (Joseph Carboni), Twitter (@JoeCarboni1), or email (joe@carbonitech.com).


PUBLISH YOUR WRITINGS HERE!

We are always looking to publish your writings on the pyATL website. All content must be related to Python, non-commercial (pitches), and comply with out code of conduct.
If you’re interested, reach out to the editors at hello@pyatl.dev