Smarter CSV ignore blank lines in csv

562 Views Asked by At

I am using Smarter CSV to and have encountered a csv that has blank lines. Is there anyway to ignore these? Smarter CSV is taking the blank line as a header and not processing the file correctly. Is there any way I can bastardize the comment_regexp?

mail.attachments.each do | attachment |
        filename = attachment.filename
        #filedata = attachment.decoded
        puts filename 
        begin
          tmp = Tempfile.new(filename)
          tmp.write attachment.decoded
          tmp.close
          puts tmp.path
          f = File.open(tmp.path, "r:bom|utf-8")
          options = {
            :comment_regexp => /^#/
          }
          data = SmarterCSV.process(f, options)
          f.close 
          puts data 

Sample File:

[test.csv[1]

output

enter image description here

1

There are 1 best solutions below

2
Cary Swoveland On

Let's first construct your file.

str = <<~_
#
# Report
#---------------
Date              header1           header2  header3      header4
        20200 jdk;df           4543 $8333              4387       

        20200 jdk              5004 $945876              67

_

fin_name = 'in'
File.write(fin_name, str)
  #=> 223

Two problems must be addressed to read this file using the method SmarterCSV::process. The first is that comments--lines beginning with an octothorpe ('#')--and blank lines must be skipped. The second is that the field separator is not a fixed-length string.

The first of these problems can be dealt with by setting the value of process' :comment_regexp option key to a regular expression:

:comment_regexp => /\A#|\A\s*\z/

which reads, "match an octothorpe at the beginning of the string (\A being the beginning-of-string anchor) or (|) match a string containing zero or more whitespace characters (\s being a whitespace character and \z being the end-of-string anchor)".

Unfortunately, SmarterCSV is not capable of dealing with variable-length field separators. It does have an option :col_sep, but it's value must be a string, not a regular expression.

We must therefore pre-process the file before using SmarterCSV, though that is not difficult. While are are at, we may as well remove the dollar signs and use commas for field separators.1

fout_name = 'out.csv'

fout = File.new(fout_name, 'w')
File.foreach(fin_name) do |line|
  fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless 
    line.match?(/\A#|\A\s*\z/)
end
fout.close

Let's look at the file produced.

puts File.read(fout_name)

displays

Date,header1,header2,header3,header4
20200,jdk;df,4543,8333,4387
20200,jdk,5004,945876,67

Now that's what a CSV file should look like! We may now use SmarterCSV on this file with no options specified:

SmarterCSV.process(fout_name)
  #=> [{:date=>20200, :header1=>"jdk;df", :header2=>4543,
  #     :header3=>8333, :header4=>4387},
  #    {:date=>20200, :header1=>"jdk", :header2=>5004,
  #     :header3=>945876, :header4=>67}]

1. I used IO::foreach to read the file line-by-line and then write each manipulated line that is neither a comment nor a blank line to the output file. If the file is not huge we could instead gulp it into a string, modify the string and then write the resulting string to the output file: File.write(fout_name, File.read(fin_name).gsub(/^#.*?\n|^[ \t]*\n|^[ \t]+|[ \t]+$|\$/, '').gsub(/[ \t]+/, ',')). The first regular expression reads, "match lines beginning with an octothorpe or lines containing only spaces and tabs or spaces and tabs at the beginning of a line or spaces and tabs at the end of a line or a dollar sign". The second gsub merely converts multiple tabs and spaces to a comma.

File.new(fout_name, 'w') File.foreach(fin_name) do |line| fout.puts(line.strip.gsub(/\s+\$?/, ',')) unless line.match?(/\A#|\A\s*\z/) end fout.close