Shuffle Text File Delphi Source or anything else

1.4k Views Asked by At

I have a stringlist with 10,000 entries. I have a shuffle routine, but accessing any of the items is taking a lot of time. Going through all 10k items takes a huge amount of time.

I want to save it do disk and then do a shuffle on the file using another method.

Any suggestions?

4

There are 4 best solutions below

3
On BEST ANSWER

How is your shuffle-routine implemented? Especially the exchange-routine? If you have written your own, along these lines:

vTempSrting := vStringList[I]; 
vStringList.Delete(I); 
vStringList.Insert(J,vTempString);

it will be very slow. Use the exchange-method on the stringlist.

This code took 78 ms on my pretty average (3 year old) computer:

program Project1;

{$APPTYPE CONSOLE}

uses
  SysUtils,Classes,uIntegerList,Windows,Math;

procedure Shuffle(aSL : TStringList);
var I,J : integer;
begin
  for I := 0 to aSL.Count-1 do
  begin
    J := randomrange(I,aSL.Count);
    aSL.Exchange(I,J);
  end;
end;

procedure CreateTestFile;
var
  vSL : TStringList;
  I : integer;
begin
  vSL := TStringList.Create;
  try
    for I := 1 to 100000 do vSL.Add('Sample text #'+inttostr(I));
    vSL.SaveToFile('c:\test.txt');
  finally
    vSL.Free;
  end;
end;

function TestShuffle : longword;
var
  vSL : TStringList;
  vTick0 : longword;
begin
  vSL := TStringList.Create;
  try
    vTick0 := gettickcount;
    vSL.LoadFromFile('c:\test.txt');
    Shuffle(vSL);
    vSL.SaveToFile('c:\test.txt');
    Result := gettickcount - vTick0;
  finally
    vSL.Free;
  end;
end;

begin
  CreateTestFile;
  writeln(TestShuffle,' ms');
  readln;
end.
1
On

I asked a question before about creating a shuffled range - rather than generating a list of numbers and then shuffling them, I wanted a function that was able to iteratively return a list of shuffled numbers, without the O(n) memory cost:

Generating shuffled range using a PRNG rather than shuffling

If you create some kind of index for your file on disk, then you can create a shuffled version without paying the memory cost, which can be important for very large files. For an index, I suggest something simple, like a flat stream of the positions (as 32 or 64-bit integers) of every line start. That way, to extract the Nth line out of the text file, you can simply seek in the index stream to N*4 (or N*8 for 64-bit indexes) to discover the offset of the line start, and then seek to that position in the text file stream and read out a line.

Using this approach, you can shuffle extremely large files, without paying the in-memory cost. Of course, shuffling will mean extracting lines at random from the source file, which will not be as efficient as in-memory sorting unless the file is very small (fits in cache almost on first access) or very large (in which case memory thrashing will be worse than random seeks), or perhaps if you're not using a mechanical hard drive (e.g. SSD).

For your situation, 10K is really not a large number. Something in the region of 10 million lines, perhaps getting into several gigabytes of text (depending on line length of course), will be far more challenging, and that's where this approach (or something similar) would be necessary in 32-bit.

1
On

Rearranging a stringlist in memory is slow, so I'd shuffle an index list as an initial optimization.

I'm guessing you chose stringlist for the convenience of loading from and saving to disk. One quicker approach would be to shuffle an index. Make an array of 10,000 integers, shuffle those, then use a temporary string variable to hold the swap element and rearrange your stringlist from top to bottom using the shuffled index values.

Major rewrites will provide greater improvements, but this may help if your strings aren't too big.

0
On

An easy way is to generate a list of random numbers, sort it, and then do pairwise swaps of data later. Sorting can be done as an o(n*log(n)) algorithm, whereas swapping always is an o(n) algorithm, thus much faster.

Just in case you haven't thought of it, consider to leave the data as it is, and just save an extra shuffled index.