數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘二



《數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘二》由會(huì)員分享,可在線(xiàn)閱讀,更多相關(guān)《數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘二(39頁(yè)珍藏版)》請(qǐng)?jiān)谘b配圖網(wǎng)上搜索。
1、Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,?Silberschatz, Korth and Sudarshan,20.,39,Click to edit Master title style,Database System Concepts - 6,th,Edition,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,Click to edit Maste
2、r title style,Chapter 20: Data Analysis,Chapter 20: Data Analysis,Decision Support Systems,Data Warehousing,Data Mining,Classification,Association Rules,Clustering,,,Decision Support Systems,Decision-support systems,are used to make business decisions, often based on data collected by on-line transa
3、ction-processing systems.,Examples of business decisions:,What items to stock?,What insurance premium to change?,To whom to send advertisements?,Examples of data used for making decisions,Retail sales transaction details,Customer profiles (income, age, gender, etc.),Decision-Support Systems: Overvie
4、w,Data analysis,tasks are simplified by specialized tools and SQL extensions,Example tasks,For each product category and each region, what were the total sales in the last quarter and how do they compare with the same quarter last year,As above, for each product category and each customer category,S
5、tatistical analysis,packages (e.g., : S++) can be interfaced with databases,Statistical analysis is a large field, but not covered here,Data mining,seeks to discover knowledge automatically in the form of statistical rules and patterns from large databases.,A,data warehouse,archives information gath
6、ered from multiple sources, and stores it under a unified schema, at a single site.,Important for large businesses that generate data from multiple divisions, possibly at multiple sites,Data may also be purchased externally,Data Warehousing,Data sources often store only current data, not historical
7、data,Corporate decision making requires a unified view of all organizational data, including historical data,A,data warehouse,is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site,Greatly simplifies querying, permits study of histori
8、cal trends,Shifts decision support query load away from transaction processing systems,,Data Warehousing,Design Issues,When and how to gather data,Source driven architecture,: data sources transmit new information to warehouse, either continuously or periodically (e.g., at night),Destination drive
9、n architecture,: warehouse periodically requests new information from data sources,Keeping warehouse exactly synchronized with data sources (e.g., using two-phase commit) is too expensive,Usually OK to have slightly out-of-date data at warehouse,Data/updates are periodically downloaded form online
10、 transaction processing (OLTP) systems.,What schema to use,Schema integration,More Warehouse Design Issues,Data cleansing,E.g., correct mistakes in addresses (misspellings, zip code errors),Merge,address lists from different sources and,purge,duplicates,How to propagate updates,Warehouse schema may
11、be a (materialized) view of schema from data sources,What data to summarize,Raw data may be too large to store on-line,Aggregate values (totals/subtotals) often suffice,Queries on raw data can often be transformed by query optimizer to use aggregate values,,Warehouse Schemas,Dimension values are usu
12、ally encoded using small integers and mapped to full values via dimension tables,Resultant schema is called a,star schema,More complicated schema structures,Snowflake schema,: multiple levels of dimension tables,Constellation,: multiple fact tables,Data Warehouse Schema,Data Mining,Data miningisthep
13、rocessofsemi-automaticallyanalyzing large databasestofind usefulpatterns,Prediction,basedonpast history,Predict if acredit cardapplicant poses agoodcreditrisk,basedonsomeattributes (income, jobtype,age, ..)andpasthistory,Predict if apatternofphonecalling cardusageislikely to be fraudulent,Some examp
14、les of predictionmechanisms:,Classification,Givena newitem whose class is unknown, predicttowhichclassitbelongs,Regression,formulae,Givena setofmappingsforanunknownfunction,predictthefunctionresult fora newparametervalue,Data Mining(Cont.),DescriptivePatterns,Associations,Find books thatare often bo
15、ughtby,“,“similar”customers.Ifanewsuchcustomerbuys onesuch book, suggestthe otherstoo.,Associationsmay be usedasafirststep in detecting,causation,E.g.,associationbetween exposure to chemical Xand cancer,,Clusters,E.g.,typhoid cases wereclustered in an areasurroundingacontaminatedwell,Detectionofclus
16、tersremainsimportantindetecting epidemics,ClassificationRules,Classificationruleshelp assignnewobjectstoclasses.,E.g.,givena newautomobile insuranceapplicant, shouldheorshebeclassifiedaslowrisk,medium riskorhighrisk?,Classificationrulesforaboveexamplecoulduseavariety of data, suchaseducationallevel,
17、 salary,age,etc.,?,personP,P.degree =masters,and,P.income> 75,000,?,P.credit= excellent,?,personP,P.degree =bachelors,and,(P.income,?,25,000and P.income,?,75,000),?,P.credit= good,Rulesarenot necessarily exact:theremaybesomemisclassifications,Classificationrulescanbeshowncompactly as adecisiontree.,
18、DecisionTree,ConstructionofDecisionTrees,Trainingset,: adatasampleinwhichthe classification is alreadyknown.,Greedy,topdowngeneration of decision trees.,Each internal nodeofthe treepartitionsthedatainto groupsbasedona,partitioningattribute,, anda,partitioningcondition,forthe node,Leaf,node:,all(or m
19、ost) of theitemsatthenodebelongtothe sameclass, or,allattributeshave beenconsidered,and no furtherpartitioning is possible.,Best Splits,Pick bestattributesandconditionsonwhichtopartition,Thepurity of aset Softraininginstances canbemeasuredquantitativelyinseveral ways.,Notation:number of classes=,k,,
20、numberofinstances =|S|,fractionofinstances in class,i,=,p,i,.,The,Gini,measureof purity isdefinedas,[,Gini(S)= 1-,?,,Whenallinstances are in asingle class, the Gini valueis0,It reaches its maximum (of 1,–,–1/,k,) ifeach classthesamenumberof instances.,,k,i,- 1,p,2,i,BestSplits(Cont.),Anothermeasureo
21、f purity isthe,entropy,measure,which is defined as,,entropy(S)= –,?,,Whena set Sissplit into multiplesetsSi,I=1, 2,,…,…,r, we canmeasure the purity oftheresultant set of sets as:,,purity(,S,1,, S,2,, ….., S,r,) =,?,,Theinformationgainduetoparticular splitofS into S,i,, i= 1,2,,…,….,r,Information-gai
22、n,(,S,, {,S,1,,,S,2,, ….,,S,r,) =purity(,S,) –purity (,S,1,,,S,2,, …,S,r,),,,,r,i,= 1,|,S,i,|,|,S,|,purity,(,S,i,),k,i-,1,p,i,log,2,p,i,BestSplits(Cont.),Measureof “cost,”,” ofa split:Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}))= –,?,,Information-gainratio,= Information-gain(,S,,{,S,1,,,S,2,
23、, ……,,S,r,}),Information-content(,S,, {,S,1,,,S,2,, …..,,S,r,}),Thebestsplit is the one that givesthemaximuminformationgain ratio,,log,2,r,i,- 1,|,S,i,|,|,S,|,|,S,i,|,|,S,|,,FindingBestSplits,Categoricalattributes (withnomeaningful order):,Multi-way split,onechild for eachvalue,Binary split: try all
24、 possible breakup of valuesinto two sets,andpickthebest,Continuous-valued attributes(can besortedin ameaningfulorder),Binary split:,Sortvalues,tryeach asa splitpoint,E.g., ifvaluesare1, 10, 15, 25, splitat,?,?1, ?10,,?,? 15,Pickthevalue thatgives best split,Multi-way split:,A seriesofbinary splits o
25、nthesame attributehasroughlyequivalent effect,,,,Decision-Tree Construction Algorithm,Procedure,GrowTree,(,S,)Partition (,S,);,Procedure,Partition (,S,),if,(,purity,(,S,) >,?,p,or |,S,| <,?,?,s,),thenreturn,;,foreach,attribute,A,evaluatesplitson attribute,A,; Usebestsplit found(acrossallattributes)
26、topartition,S,into,S,1,, S,2,, …., S,r,,,for,i,= 1,2,,…,…..,,r,Partition (,S,i,);,Other Typesof Classifiers,Neural net classifiers are studied in artificialintelligence and are not covered here,Bayesianclassifiersuse,Bayes theorem,, whichsays,p,(,c,j,|,d,) =,p,(,d,| c,j,),p,(,c,j,),p,(,d,)where,p,(,
27、c,j,|,d,) =probabilityof instance,d,being inclass,c,j,,,p,(,d,| c,j,) =probabilityof generating instance,d,given class,c,j,,,p,(,c,j,)= probability ofoccurrenceof class,c,j,, and,p,(,d,) =probabilityof instance,d,occuring,,Na?ve Bayesian Classifiers,Bayesianclassifiersrequire,computationof,p,(,d,| c
28、,j,),precomputation of,p,(,c,j,),p,(,d,) can beignoredsince it isthesame for all classes,To simplifythetask,,na?ve Bayesian classifiers,assume attributes have independent distributions, and thereby estimate,p,(,d,|,c,j,) =,p,(,d,1,|,c,j,) *,p,(,d,2,|,c,j,) *,…,….*(,p,(,d,n,|,c,j,),Eachofthe,p,(,d,i,
29、|,c,j,) can beestimated froma histogramon,d,i,values for eachclass,c,j,thehistogram iscomputed from the traininginstances,Histograms on multiple attributes are more expensivetocomputeandstore,,Regression,Regression dealswith the predictionofa value,ratherthana class.,Given valuesfora set of variable
30、s,X,1,, X,2,, …,X,n,, wewish topredictthevalue of avariableY.,Onewayis to infercoefficientsa,0,, a,1,, a,1,, …,a,n,suchthat,Y,=,a,0,+,a,1,*,X,1,+,a,2,*,X,2,+ …+,a,n,*,X,n,Findingsucha linear polynomialiscalled,linear regression,.,In general,theprocessof finding acurve thatfitsthedata isalso called,c
31、urve fitting,.,Thefitmayonlybeapproximate,becauseof noiseinthedata, or,becausetherelationshipisnotexactlya polynomial,Regression aimsto findcoefficientsthat give the bestpossiblefit.,AssociationRules,Retail shopsareoften interested inassociations between differentitems that people buy.,Someonewhobuy
32、sbread is quitelikely alsoto buy milk,A personwhoboughtthebook,DatabaseSystemConcepts,is quitelikelyalsotobuythebook,Operating SystemConcepts,.,Associationsinformationcanbeusedinseveralways.,E.g., when acustomer buys aparticularbook, anonlineshopmaysuggestassociatedbooks.,Associationrules:,bread,?,m
33、ilkDB-Concepts,OS-Concepts,?Networks,Lefthandside:,antecedent,,righthandside:,consequent,Anassociationrulemusthaveanassociated,population,;thepopulationconsistsofasetof,instances,E.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation,AssociationRules(Cont.),Ruleshave
34、anassociatedsupport,aswellasanassociatedconfidence.,Support,isameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.,milk,?,screwdrivers,islow.,Confidence,isameasureofhowoftentheconsequentistruewhentheantecedentistrue.,E.g.,therule,bread,?,milk,hasaconfidenceof80pe
35、rcentif80percentofthepurchasesthatincludebreadalsoincludemilk.,,,FindingAssociationRules,Wearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater),Na,?,?vealgorithm,Considerallpossiblesetsofrelevantitems.,Foreachsetfinditssupport(i.e.,counthowmanytransactions
36、purchaseallitemsintheset).,Largeitemsets,:setswithsufficientlyhighsupport,Uselargeitemsetstogenerateassociationrules.,Fromitemset,A,generatetherule,A,-{,b,} ?,b,foreach,b,?,A.,Supportof rule= support (,A),.,Confidence of rule =support(,A,) /support(,A,- {,b,}),FindingSupport,Determine support ofitem
37、sets via asingle passon set of transactions,Large itemsets:setswith ahighcount at the end ofthepass,If memory not enoughtoholdallcountsforallitemsetsusemultiple passes, considering only someitemsetsineachpass.,Optimization: Once an itemset iseliminatedbecauseitscount (support)is too smallnone ofitss
38、upersets needstobe considered.,The,a priori,technique tofind largeitemsets:,Pass1:count support ofallsets with just1 item.Eliminate thoseitems withlowsupport,Pass,i,:,candidates,: everysetof,i,items such thatallits,i-1,itemsubsetsarelarge,Count support ofallcandidates,Stopifthere are nocandidates,Ot
39、her Typesof Associations,Basic association ruleshaveseverallimitations,Deviations fromtheexpectedprobabilityaremore interesting,E.g., ifmany peoplepurchase bread,andmany peoplepurchase cereal, quitea few wouldbe expectedto purchaseboth,We are interested in,positive,as wellas,negativecorrelations,bet
40、weensetsofitems,Positivecorrelation: co-occurrenceis higher than predicted,Negativecorrelation: co-occurrenceis lowerthan predicted,Sequenceassociations /correlations,E.g., whenever bondsgoup,stock pricesgodownin2 days,Deviations fromtemporalpatterns,E.g., deviationfroma steady growth,E.g., salesof
41、winter wear go down in summer,Notsurprising,partofa knownpattern.,Lookfordeviation fromvalue predictedusing past patterns,Clustering,Clustering:Intuitively,findingclusters ofpointsin the givendata such thatsimilarpoints lie in the same cluster,Canbe formalized usingdistancemetricsinseveralways,Group
42、 pointsinto,k,sets(for agiven,k,) such thattheaveragedistanceofpoints fromthecentroidoftheir assigned groupisminimized,Centroid: pointdefinedby taking average ofcoordinatesineachdimension.,Anothermetric:minimizeaveragedistance between everypairofpoints in acluster,Hasbeenstudiedextensivelyinstatisti
43、cs,buton smalldata sets,Dataminingsystemsaimat clustering techniquesthat can handlevery largedatasets,E.g., the Birchclustering algorithm(more shortly),HierarchicalClustering,Examplefrombiologicalclassification,(theword classificationhere does not meana predictionmechanism),chordatamammaliareptilial
44、eopardshumanssnakescrocodiles,Other examples:Internetdirectory systems (e.g., Yahoo,more onthis later),Agglomerative clusteringalgorithms,Build smallclusters, then cluster smallclusters into bigger clusters,andso on,Divisiveclusteringalgorithms,Start with all itemsina singlecluster, repeatedlyrefine
45、(break)clustersinto smaller ones,Clustering Algorithms,Clustering algorithms have beendesignedtohandle verylarge datasets,E.g., the,Birch algorithm,Mainidea: use an in-memoryR-tree to storepoints thatarebeing clustered,Insert points one ata timeintotheR-tree,merginga new pointwith anexisting cluster
46、 ifislessthan some,?,distanceaway,If therearemore leaf nodesthan fit inmemory,merge existingclustersthat are closeto eachother,At the end of firstpasswegeta largenumber of clusters at the leavesoftheR-tree,Merge clusters to reducethenumberof clusters,Collaborative Filtering,Goal:predict what movies/
47、books/… aperson may beinterestedin,on the basis of,Pastpreferences ofthe person,Otherpeople with similarpastpreferences,The preferencesof such peoplefora newmovie/book/…,One approach based on repeatedclustering,Cluster peopleon the basis ofpreferences for movies,Thencluster movieson the basis ofbein
48、g liked bythesameclusters of people,Againcluster peoplebased ontheirpreferences for (the newly createdclustersof) movies,Repeat above till equilibrium,Aboveproblem is aninstance of,collaborative filtering,, where users collaboratein the task offilteringinformation tofindinformation ofinterest,OtherT
49、ypes ofMining,Textmining,: application of data mining to textualdocuments,cluster Web pages tofindrelated pages,cluster pages auserhasvisited toorganizetheirvisit history,classify Web pages automatically into aWeb directory,Datavisualization,systems help users examine large volumesof data and detectpatterns visually,Can visually encodelargeamounts of information on a singlescreen,Humans areverygooda detecting visualpatterns,End of Chapter,Figure 20.01,Figure 20.02,Figure 20.03,Figure 20.05,演講完畢,,謝,謝謝觀(guān)看!,
- 溫馨提示:
1: 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2: 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
3.本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 裝配圖網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 專(zhuān)題黨課講稿:以高質(zhì)量黨建保障國(guó)有企業(yè)高質(zhì)量發(fā)展
- 廉政黨課講稿材料:堅(jiān)決打好反腐敗斗爭(zhēng)攻堅(jiān)戰(zhàn)持久戰(zhàn)總體戰(zhàn)涵養(yǎng)風(fēng)清氣正的政治生態(tài)
- 在新錄用選調(diào)生公務(wù)員座談會(huì)上和基層單位調(diào)研座談會(huì)上的發(fā)言材料
- 總工會(huì)關(guān)于2025年維護(hù)勞動(dòng)領(lǐng)域政治安全的工作匯報(bào)材料
- 基層黨建工作交流研討會(huì)上的講話(huà)發(fā)言材料
- 糧食和物資儲(chǔ)備學(xué)習(xí)教育工作部署會(huì)上的講話(huà)發(fā)言材料
- 市工業(yè)園區(qū)、市直機(jī)關(guān)單位、市紀(jì)委監(jiān)委2025年工作計(jì)劃
- 檢察院政治部關(guān)于2025年工作計(jì)劃
- 辦公室主任2025年現(xiàn)實(shí)表現(xiàn)材料
- 2025年~村農(nóng)村保潔員規(guī)范管理工作方案
- 在深入貫徹中央8項(xiàng)規(guī)定精神學(xué)習(xí)教育工作部署會(huì)議上的講話(huà)發(fā)言材料4篇
- 開(kāi)展深入貫徹規(guī)定精神學(xué)習(xí)教育動(dòng)員部署會(huì)上的講話(huà)發(fā)言材料3篇
- 在司法黨組中心學(xué)習(xí)組學(xué)習(xí)會(huì)上的發(fā)言材料
- 國(guó)企黨委關(guān)于推動(dòng)基層黨建與生產(chǎn)經(jīng)營(yíng)深度融合工作情況的報(bào)告材料
- 副書(shū)記在2025年工作務(wù)虛會(huì)上的發(fā)言材料2篇
相關(guān)資源
更多