I am reading a flat file (fixed width fields) so that I can import it into SQLServer. This is a new project in my company. I have not yet worked on parsing huge files in the most time-efficient and quick way. Scouring on the Internet, I have found a powershell script which can parse a fixed-width file... It does the job of parsing the file.
[void][System.Reflection.Assembly]::LoadWithPartialName('Microsoft.VisualBasic')
$Parser = New-Object Microsoft.VisualBasic.FileIO.TextFieldParser(
'C:\T1046436.dat')
$Parser.TextFieldType = 'FixedWidth'
$Parser.TrimWhiteSpace = $False
$Parser.FieldWidths = @(13,6,11,2,10,10,1,1,7,7,2,17,1,2,2,4,1,10,10,10,10,2,10,10,11,2,2,1,1)
while(!$Parser.EndOfData)
{
try
{
Write-Host $Parser.ReadFields()
}
catch [Microsoft.VisualBasic.FileIO.MalformedLineException]
{
Write-Host "Error, line $($_.Exception.LineNumber): $($Parser.ErrorLine)"
}
}
I want to be able to save this in a pipe-delimited file, so that I can simply BCP IN the sql database. Thats my quick-dirty way to load data. But, it is using lots of time (20 minutes for 50000 records).
Any suggestions to get it quickly/ efficiently:
- converted into pipe-delimited file OR
- directly import the flat file into SQL server from powershell... OR
- I am open to use any other scripting language which can help parse the flat-file quickly into a pipe-delimited file (python, perl etc) under Windows. Any example scripts to use will be appreciated.
This Powershell script will parse a collection of $files containing records of 5 fields, each having 10 characters, and output those records with the fields pipe-delimited.
You'll need to modify the regex and filter to match your data. I suspect it will take considerably less than 20 minutes to chew through half a million of those records.
The
-Readcount
will control memory usage by keeping only 1000 records at a time in the pipeline. They will be passed to the pipeline as an array, and the-replace
operator in the filter will delimit the entire array in one operation, without needing toforeach
through each record. Thefilter
is admittedly unusual, and could be replaced withforeach-object
, but thefilter
is marginally faster and it adds up if you're doing lots of reps.