How does the downcast logic in pd.to_numeric decide when to downcast from Float64 to Float32?

49 Views Asked by At

I am trying to understand the logic behind pd.to_numeric float downcasting. I'm looking for the specific condition(s) that are used.

I was hoping / expecting that it would preserve the uniqueness of the values, but in my example, it does not.

The docs do not explain the logic for any of the downcasting options: float, integer and unsigned. I would love to understand them all.

import pandas as pd
import numpy as np
s = pd.Series(np.random.uniform(0, 1, 100_000), dtype="Float64")
s_float32 = pd.to_numeric(s, downcast="float")
print(s_float32.dtype)
print(s_float32.nunique() == s.nunique())
Float32
False
1

There are 1 best solutions below

2
Corralien On

You can read the source of pd.to_numeric. The code responsible to downcast is here

The key is:

>>> np.typecodes["Float"]
'efdg'

From the documentation, letters stand for:

  • e (half): float16
  • f (single): float32
  • d (double): float64
  • g (longdouble): float128

The algorithm is to try dtype from the smallest to the largest.

Update

I do not see where the condition being used to select a particular dtype

And yet it's all there. The code resolution looks like:

from pandas.core.dtypes.cast import maybe_downcast_numeric
from pandas.core.arrays import FloatingArray

# L179
values = s.values  # extract numpy array
typecodes = np.typecodes["Float"]  # downcast='float'

# L202
mask = values._mask
values = values._data[~mask]  # convert pd.Float64 to np.float64

# L260
# remove float16 from possible dtypes
float_32_char = np.dtype(np.float32).char
float_32_ind = typecodes.index(float_32_char)
typecodes = typecodes[float_32_ind:]

# L264
for typecode in typecodes: # for f, d and g
    dtype = np.dtype(typecode)  # convert as dtype
    if dtype.itemsize <= values.dtype.itemsize:
        values = maybe_downcast_numeric(values, dtype)
# 1st iteration 'f': values are downcast to float32
# 2nd iteration 'd': no downcast because values is now float32
# 3rt iteration 'g': no downcast because values is now float32

# L284
data = np.zeros(mask.shape, dtype=values.dtype)

# L300
klass = FloatingArray
values = klass(data, mask)  # convert np.float32 to pd.Float32

Output:

>>> values
<FloatingArray>
[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,
 ...
 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Length: 100000, dtype: Float32