《數(shù)據(jù)庫系統(tǒng)》教學(xué)課件
《數(shù)據(jù)庫系統(tǒng)》教學(xué)課件,數(shù)據(jù)庫系統(tǒng),數(shù)據(jù)庫,系統(tǒng),教學(xué),課件
External SortingJianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYcourtesy of Joe Hellerstein for some slidesWhy Sort?nA classic problem in computer science!nData requested in sorted order qe.g.,find students in increasing gpa ordernFirst step in bulk loading B+tree index.nUseful for eliminating duplicates(Why?)nUseful for summarizing groups of tuplesnSort-merge join algorithm involves sorting.nProblem:sort 100Gb of data with 1Gb of RAM.qwhy not virtual memory?2-Way Sort:Requires 3 BuffersnPass 0:Read a page,sort it,write it.qonly one buffer page is used.qeach sorted page(or subfiles)is called a run.nPass 1,2,3,etc.:qrequires 3 buffer pagesqmerge pairs of runs into runs twice as longqthree buffer pages used.Main memory buffersINPUT 1INPUT 2OUTPUTDiskDiskMerging RunsTwo-Way External Merge SortnEach pass we read+write each page in file.nN pages in the file=the number of passesnSo total cost is:nIdea:Divide and conquer:sort subfiles and mergeInput file1-page runs2-page runs4-page runs8-page runsPASS 0PASS 1PASS 2PASS 393,46,29,48,75,63,123,45,62,64,97,81,322,34,64,78,91,35,622,34,46,78,91,23,561,22,33,44,56,67,8Merging RunsGeneral External Merge SortnTo sort a file with N pages using B buffer pages:qPass 0:use B buffer pages.Produce sorted runs of B pages each.qPass 1,2,etc.:merge B-1 runs.B Main memory buffersINPUT 1INPUT B-1OUTPUTDiskDiskINPUT 2.*More than 3 buffer pages.How can we utilize them?Cost of External Merge SortnNumber of passes:nCost=2N*(#of passes)nE.g.,with 5 buffer pages,to sort 108 page file:qPass 0:=22 sorted runs of 5 pages each(last run is only 3 pages)qPass 1:=6 sorted runs of 20 pages each(last run is only 8 pages)qPass 2:2 sorted runs,80 pages and 28 pagesqPass 3:Sorted file of 108 pagesFormula check:log4 22=3 +1 4 passes Number of Passes of External Sort(I/O cost is 2N times number of passes)Blocked I/O for External Merge SortnDo I/O a page at a timeqNot one I/O per recordnIn fact,read a block(chunk)of pages sequentially!nSuggests we should make each buffer(input/output)be a block of pages.qBut this will reduce fan-in during merge passes!qIn practice,most files still sorted in 2-3 passes.Theme:Amortize a random I/O across more data read.But pay for it in memory footprint Double BufferingnGoal:reduce wait time for I/O requests during mergenIdea:2 blocks RAM per run,disk reader fills one while sort merges the otherqPotentially,more passes;in practice,most files still sorted in 2-3 passes.OUTPUTOUTPUTDiskDiskINPUT 1INPUT kINPUT 2INPUT 1INPUT 2INPUT kblock sizebB main memory buffers,k-way mergeTheme:overlap I/O and CPU activity via read-ahead(prefetching)Using B+Trees for SortingnScenario:Table to be sorted has B+tree index on sorting column(s).nIdea:Can retrieve records in order by traversing leaf pages.nIs this a good idea?nCases to consider:qB+tree is clusteredqB+tree is not clusteredGood idea!Could be a very bad idea!Clustered B+Tree Used for SortingnCost:root to the left-most leaf,then retrieve all leaf pages(Alternative 1)nIf Alternative 2 is used?Additional cost of retrieving data records:each page fetched just once.*Always better than external sorting!(Directs search)Data RecordsIndexData Entries(Sequence set)Unclustered B+Tree Used for SortingnAlternative(2)for data entries;each data entry contains rid of a data record.In general,one I/O per data record!(Directs search)Data RecordsIndexData Entries(Sequence set)Cost of External Sorting vs.Unclustered Index*p:number of records per page*B=1,000 and block size=32 for sorting*p=100 is the more realistic value.N:number of pages of recordsSummarynExternal sorting is importantnExternal merge sort minimizes disk I/O cost:qPass 0:Produces sorted runs of size B(#buffer pages).Later passes:merge runs.q#of runs merged at a time depends on B,and block size.qLarger block size means less I/O cost per page.qLarger block size means smaller#runs merged.qIn practice,#of runs rarely more than 2 or 3.Summary,cont.nChoice of internal sort algorithm may matter:qQuicksort:Quick!qHeap/tournament sortnThe best sorts are wildly fast:qDespite 40+years of research,still improving!nClustered B+tree is good for sorting;unclustered tree is usually very bad.
收藏