I happened to look at the algorithmic details of UNIX Sort, a LINUX version of the classic UNIX sort is a part of GNU coreutils-6.9.90. This is classic example of the standard External R-Way merge , to sort a data of size N bytes with a main memory size of M so it creates N/M runs and merges R at a time, the number of passes through the data is log(N/M)/log(R) passes.In fact the lower bound(runtime) for external sorting is Ω((N/M)log(N/M)/log(R)). All the external memory sorting algorithms provided in the literature are optimal so the fight here is minimizing the constant before the number of passes.
UNIX sort treats keys are lines (strings), the algorithm followed by
unix sort is in fact the R-Way merge. Let the input file size be
IN_SIZE.
1. Choosing Run Size:
--------------------------------
The sizes of the initial runs are chosen from the total physical
memory (TOTAL_PHY) and available memory (AVAIL_PHY).
RUN_SIZE = (MAX(TOTAL_PHY/8,AVAIL_PHY))/2
maximum of 1/8th of TOTAL_PHY and AVAIL_PHY and divided by 2.
See function "default_sort_size (void)" in the code.
2. Creating Runs:
-------------------------
Unix sort creates a temporary file for every run. So it creates
IN_SIZE/RUN_SIZE (celing) temporary files. Internally it uses merge
sort to sort internally it uses an optimization mentioned in Knuth
volume 3 (2nd edition), problem 5.2.4-23.
3. Merging:
----------------
The number of runs merged at any time is hard coded in the program
see macro NMERGE , NMERGE is defined to be 16 so it merges
exactly 16 runs at any time.

4 comments:
Can anyone recommend the robust Script Deployment system for a small IT service company like mine? Does anyone use Kaseya.com or GFI.com? How do they compare to these guys I found recently: N-able N-central MSP solution
? What is your best take in cost vs performance among those three? I need a good advice please... Thanks in advance!
I just discovered the website who reviews about
Several
home based business opportunity
If you want to know more here it is
home based business reviews
www.home-businessreviews.com
My friend and I were recently talking about the prevalence of technology in our day to day lives. Reading this post makes me think back to that discussion we had, and just how inseparable from electronics we have all become.
I don't mean this in a bad way, of course! Societal concerns aside... I just hope that as memory becomes less expensive, the possibility of transferring our brains onto a digital medium becomes a true reality. It's one of the things I really wish I could encounter in my lifetime.
(Posted on Nintendo DS running [url=http://knol.google.com/k/anonymous/-/9v7ff0hnkzef/1]R4i SDHC[/url] DS SurfV3)
I'm wondering if execution time in some scenarios couldn't benefit from using different algorithms. For example, when using a sort command within a long pipe, a selection sort algorithm might be better because, despite being asimptotically worse, it starts outputting results sooner than merge sort, thus reducing the "stall" added by the sort.
The drawback would be memory consumption and perhaps efficiency when using very large files (although if that's a problem would depend on the rest of the pipeline), so it shouldn't be thought as a general solution. But perhaps a command-line option for using a "low-latency" algorithm, or even an automatic switch when in and out were stdin and stdout, would be handy.
Post a Comment