Add thousands separator to column in dataframe in julia

382 Views Asked by At

I have a dataframe with two columns a and b and at the moment both are looking like column a, but I want to add separators so that column b looks like below. I have tried using the package format.jl. But I haven't gotten the result I'm afte. Maybe worth mentioning is that both columns is Int64 and the column names a and b is of type symbol.

 a      |    b
150000  | 1500,00 
27      | 27,00
16614   | 166,14

Is there some other way to solve this than using format.jl? Or is format.jl the way to go?

3

There are 3 best solutions below

6
Sundar R On

(Edited to reflect the updated specs in the question)

julia> df = DataFrame(a = [150000, 27, 16614]);

julia> function insertdecimalcomma(n)
         if n < 100
           return string(n) * ",00"
         else
           return replace(string(n), r"(..)$" => s",\1")
         end
       end
insertdecimalcomma (generic function with 1 method)

julia> df.b = insertdecimalcomma.(df.a)

julia> df
3×2 DataFrame
 Row │ a       b       
     │ Int64   String  
─────┼─────────────────
   1 │ 150000  1500,00
   2 │     27  27,00
   3 │  16614  166,14

Note that the b column will necessarily be a String after this change, as integer types cannot store formatting information in them.

If you have a lot of data and find that you need better performance, you may also want to use the InlineStrings package:

julia> #same as before upto the function definition

julia> using InlineStrings

julia> df.b = inlinestrings(insertdecimalcomma.(df.a))
3-element Vector{String7}:
 "1500,00"
 "27,00"
 "166,14"

This stores the b column's data as fixed-size strings (String7 type here), which are generally treated like normal Strings, but can be significantly better for performance.

0
Nathan Boyer On

Assuming you want the commas in their typical positions rather than how you wrote them, this is one way:

julia> using DataFrames, Format

julia> f(x) = format(x, commas=true)
f (generic function with 1 method)

julia> df = DataFrame(a = [1000000, 200000, 30000])
3×1 DataFrame
 Row │ a       
     │ Int64
─────┼─────────
   1 │ 1000000
   2 │  200000
   3 │   30000

julia> transform(df, :a => ByRow(f) => :a_string)
3×2 DataFrame
 Row │ a        a_string  
     │ Int64    String
─────┼────────────────────
   1 │ 1000000  1,000,000
   2 │  200000  200,000
   3 │   30000  30,000

If you instead want the row replaced, use transform(df, :a => ByRow(f), renamecols=false). If you just want the output vector rather than changing the DataFrame, you can use format.(df.a, commas=true)

You could write your own function f to achieve the same behavior, but you might as well use the one someone already wrote inside the Format.jl package.

However, once you transform you data to Strings as above, you won't be able to filter/sort/analyze the numerical data in the DataFrame. I would suggest that you apply the formatting in the printing step (rather than modifying the DataFrame itself to contain strings) by using the PrettyTables package. This can format the entire DataFrame at once.

julia> using DataFrames, PrettyTables

julia> df = DataFrame(a = [1000000, 200000, 30000], b = [500, 6000, 70000])
3×2 DataFrame
 Row │ a        b     
     │ Int64    Int64 
─────┼────────────────
   1 │ 1000000    500
   2 │  200000   6000
   3 │   30000  70000

julia> pretty_table(df, formatters = ft_printf("%'d"))
┌───────────┬────────┐
│         a │      b │
│     Int64 │  Int64 │
├───────────┼────────┤
│ 1,000,000 │    500 │
│   200,000 │  6,000 │
│    30,000 │ 70,000 │
└───────────┴────────┘
0
user3680029 On

May be someone will search here for a space thousand separator I cooked up the following regex based transcription :

#first conversion of a float to string (w/o scientific notation)
str = @sprintf("%.2f",nbr) 
matches = eachmatch(r"\d{1,3}(?=(?:\d{3})+(?!\d))", str)
decimal_match = match(r"\d{3}\.\d+$", str)
... vcat the vector of matches with decimal part, then join with " "... 

You only need to deal with numbers up < 100.00 as the regex doesn't account for them (it's simplistic but do the job). I use this function as a formatter in a pretty print of a DataFrame.