數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘技術(shù)講座
Click to edit Master title style,,Click to edit Master text styles,,Second level,,Third level,,Fourth level,,Fifth level,,*,,*,Slide Title,,Body Text,,Second level,,Third level,,數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘綜述,概念、體系結(jié)構(gòu)、趨勢(shì)、應(yīng)用,報(bào)告人:朱建秋,2001年6月7日,,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)倉(cāng)庫(kù)概念,基本概念,,對(duì)數(shù)據(jù)倉(cāng)庫(kù)的一些誤解,基本概念—,數(shù)據(jù)倉(cāng)庫(kù),,Data warehouse is a subject oriented, integrated,non-volatile and time variant,collection of data,in support of management’s decision ——,[Inmon,1996],.,Data warehouse is,a set of methods, techniques,and tools,that may be leveraged together to produce a vehicle that delivers data to end-users on an integrated platform ——,[Ladley,1997],.,Data warehouse is,a process,of crating, maintaining,and using a decision-support infrastructure ——,[Appleton,1995][Haley,1997][Gardner 1998].,基本概念—,數(shù)據(jù)倉(cāng)庫(kù)特征,[,Inmon,1996],面向主題,一個(gè)主題領(lǐng)域的表來源于多個(gè)操作型應(yīng)用(如:客戶主題,來源于:定單處理;應(yīng)收帳目;應(yīng)付帳目;…),典型的主題領(lǐng)域:客戶;產(chǎn)品;交易;帳目,主題領(lǐng)域以一組相關(guān)的表來具體實(shí)現(xiàn),相關(guān)的表通過公共的鍵碼聯(lián)系起來(如:顧客標(biāo)識(shí)號(hào),Customer ID),每個(gè)鍵碼都有時(shí)間元素(從日期到日期;每月累積;單獨(dú)日期…),主題內(nèi)數(shù)據(jù)可以存儲(chǔ)在不同介質(zhì)上(綜合級(jí),細(xì)節(jié)級(jí),多粒度),集成,數(shù)據(jù)提取、凈化、轉(zhuǎn)換、裝載,穩(wěn)定性,批處理增加,倉(cāng)庫(kù)已經(jīng)存在的數(shù)據(jù)不會(huì)改變,隨時(shí)間而變化(時(shí)間維),管理決策支持,基本概念—,Data Mart, ODS,Data Mart,數(shù)據(jù)集市 --,,小型的,面向部門或工作組級(jí)數(shù)據(jù)倉(cāng)庫(kù)。,Operation Data Store,操作數(shù)據(jù)存儲(chǔ) —,ODS,是能支持企業(yè)日常的全局應(yīng)用的數(shù)據(jù)集合,是不同于,DB,的一種新的數(shù)據(jù)環(huán)境, 是,DW,擴(kuò)展后得到的一個(gè)混合形式。四個(gè)基本特點(diǎn):面向主題的(,Subject -Oriented)、,集成的、可變的、 當(dāng)前或接近當(dāng)前的。,基本概念—,ETL,,元數(shù)據(jù),粒度,分割,ETL,ETL(Extract/Transformation/Load)—,數(shù)據(jù)裝載、轉(zhuǎn)換、抽取工具。,Microsoft DTS; IBM Visual Warehouse etc.,元數(shù)據(jù),關(guān)于數(shù)據(jù)的數(shù)據(jù),,用于構(gòu)造、維持、管理、和使用數(shù)據(jù)倉(cāng)庫(kù),,在數(shù)據(jù)倉(cāng)庫(kù)中尤為重要。,粒度,數(shù)據(jù)倉(cāng)庫(kù)的數(shù)據(jù)單位中保存數(shù)據(jù)的細(xì)化或綜合程度的級(jí)別。細(xì)化程度越高,粒度越小。,分割,數(shù)據(jù)分散到各自的物理單元中去,它們能獨(dú)立地處理。,對(duì)數(shù)據(jù)倉(cāng)庫(kù)的一些誤解,數(shù)據(jù)倉(cāng)庫(kù)與,OLAP,,星型數(shù)據(jù)模型,,多維分析,,數(shù)據(jù)倉(cāng)庫(kù)不是一個(gè)虛擬的概念,,數(shù)據(jù)倉(cāng)庫(kù)與范式理論,,需要非范式化處理,,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,體系結(jié)構(gòu),,ETL,工具,,元數(shù)據(jù)庫(kù)(,Repository),及元數(shù)據(jù)管理,,數(shù)據(jù)訪問和分析工具,體系結(jié)構(gòu),,[,Pieter ,1998,],Source,Databases,Data Extraction,,Transformation, load,,,Warehouse,Admin.,Tools,Extract,,Transform,and Load,,Data,Modeling,Tool,,,,Central,Metadata,,Architected,Data Marts,Data Access,and Analysis,End-User,DW Tools,Central Data,Warehouse,,Central,Data,Warehouse,,,,,Mid-,Tier,,Mid-,Tier,,,,,,,,Data,Mart,,Data,Mart,,,,Local,Metadata,,,,Local,Metadata,,,,Local,Metadata,Metadata,Exchange,MDB,,,,,,,,,,Data,Cleansing,Tool,,,,,,,,,,,,,,Relational,Appl. Package,Legacy,External,,RDBMS,RDBMS,帶,ODS,的體系結(jié)構(gòu),Source,Databases,Hub - Data Extraction,,Transformation, load,,,Warehouse,Admin.,Tools,Extract,,Transform,and Load,,Data,Modeling,Tool,,,,Central,Metadata,Architected,Data Marts,Data Access,and Analysis,Central Data Ware-,house and ODS,,Central,Data,Warehouse,,,,,Mid-,Tier,,,,RDBMS,,Data,Mart,,Mid-,Tier,,,,RDBMS,,Data,Mart,,,,Local,Metadata,,,,Local,Metadata,,,,Local,Metadata,Metadata,Exchange,,ODS,,,,OLTP,Tools,,,Data,Cleansing,Tool,,,,,,,,,,,,,Relational,Appl. Package,Legacy,External,,MDB,,,,,,,,,,End-User,DW Tools,現(xiàn)實(shí)環(huán)境—異質(zhì)性,[,Douglas Hackney ,2001,],,,Custom,Marketing,Data,Warehouse,Packaged,Oracle,Financial,Data,Warehouse,Packaged,I2 Supply Chain,Non- Architected,Data Mart,Subset,Data Marts,,,,,,,,,,,Oracle Financials,,,i2 Supply Chain,,Siebel CRM,,3,rd Party,,,,,,,,,e-Commerce,,,,聯(lián)合型數(shù)據(jù)倉(cāng)庫(kù)/數(shù)據(jù)集市體系結(jié)構(gòu),,Real Time,ODS,Federated,Financial,Data,Warehouse,Subset,Data Marts,Common,Staging,Area,,,,,,,,,Oracle Financials,,,i2 Supply Chain,,Siebel CRM,,3,rd Party,,,,,,,,Federated,Packaged,I2 Supply,Chain,Data Marts,,,Analytical,Applications,,e-Commerce,,,,Real Time,Data Mining,and Analytics,Real Time,Segmentation,,Classification,,Qualification,,Offerings, etc.,Federated,Marketing,Data,Warehouse,,,,ETL tools & DW templates,Data profiling & reengineering tools,Demand-driven data acquisition & analysis,Metadata Interchange,Federated data warehouse and data mart systems,Decision engine models, rules and metrics,OLAP & data mining tools,,,Analysis templates,Analytic application development tools & components,Analytic applications,Front- and back-office OLTP,e-Business systems,External information providers,CRM Analytics & Reporting,Supply Chain Analytics & Reporting,EKP - Enterprise Knowledge Management Portal,EPM Analytics & Reporting,Business information & recommendations,Informed decisions & actions,Financial Analytics & Reporting,HR Analytics & Reporting,閉環(huán)的聯(lián)合型,BI,體系結(jié)構(gòu),數(shù)據(jù)倉(cāng)庫(kù)的焦點(diǎn)問題-,數(shù)據(jù)的獲得、存儲(chǔ)和使用,,,Relational,Package,Legacy,External,source,Data,Clean,Tool,Data,Staging,Enterprise,Data,Warehouse,Datamart,Datamart,RDBMS,ROLAP,RDBMS,,,,End-User,Tool,,,,End-User,Tool,,MDB,,,,End-User,Tool,,,,End-User,Tool,,,,,,,,數(shù)據(jù)倉(cāng)庫(kù)和集市的加載能力至關(guān)重要,數(shù)據(jù)倉(cāng)庫(kù)和集市的查詢輸出能力至關(guān)重要,,ETL,工具,去掉操作型數(shù)據(jù)庫(kù)中的不需要的數(shù)據(jù),,統(tǒng)一轉(zhuǎn)換數(shù)據(jù)的名稱和定義,,計(jì)算匯總數(shù)據(jù)和派生數(shù)據(jù),,估計(jì)遺失數(shù)據(jù)的缺省值,,調(diào)節(jié)源數(shù)據(jù)的定義變化,,ETL,工具體系結(jié)構(gòu),,元數(shù)據(jù)庫(kù)及元數(shù)據(jù)管理,元數(shù)據(jù)分類:技術(shù)元數(shù)據(jù);商業(yè)元數(shù)據(jù);數(shù)據(jù)倉(cāng)庫(kù)操作型信息。,-[,Alex Berson etc, 1999,],技術(shù)元數(shù)據(jù),包括為數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì)人員和管理員使用的數(shù)據(jù)倉(cāng)庫(kù)數(shù)據(jù)信息,用于執(zhí)行數(shù)據(jù)倉(cāng)庫(kù)開發(fā)和管理任務(wù)。包括:,數(shù)據(jù)源信息,轉(zhuǎn)換描述(從操作數(shù)據(jù)庫(kù)到數(shù)據(jù)倉(cāng)庫(kù)的映射方法,以及轉(zhuǎn)換數(shù)據(jù)的算法),目標(biāo)數(shù)據(jù)的倉(cāng)庫(kù)對(duì)象和數(shù)據(jù)結(jié)構(gòu)定義,數(shù)據(jù)清洗和數(shù)據(jù)增加的規(guī)則,數(shù)據(jù)映射操作,訪問權(quán)限,備份歷史,存檔歷史,信息傳輸歷史,數(shù)據(jù)獲取歷史,數(shù)據(jù)訪問,等等,元數(shù)據(jù)庫(kù)及元數(shù)據(jù)管理,,商業(yè)元數(shù)據(jù),給用戶易于理解的信息,包括:,主題區(qū)和信息對(duì)象類型,包括查詢、報(bào)表、圖像、音頻、視頻等,Internet,主頁(yè),支持?jǐn)?shù)據(jù)倉(cāng)庫(kù)的其它信息,例如對(duì)于信息傳輸系統(tǒng)包括預(yù)約信息、調(diào)度信息、傳送目標(biāo)的詳細(xì)描述、商業(yè)查詢對(duì)象,等,數(shù)據(jù)倉(cāng)庫(kù)操作型信息,例如,數(shù)據(jù)歷史(快照,版本),擁有權(quán),抽取的審計(jì)軌跡,數(shù)據(jù)用法,,元數(shù)據(jù)庫(kù)及元數(shù)據(jù)管理,元數(shù)據(jù)庫(kù)(,metadata repository),和工具,— [,Martin Stardt,2000,],數(shù)據(jù)訪問和分析工具,報(bào)表,,OLAP,,數(shù)據(jù)挖掘,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),自上而下(,Top-Down),,自底而上(,Bottom Up),,混合的方法,,數(shù)據(jù)倉(cāng)庫(kù)建模,Top-down Approach,Build Enterprise data warehouse,Common central data model,Data re-engineering performed once,Minimize redundancy and inconsistency,Detailed and history data; global data discovery,Build datamarts from the Enterprise Data Warehouse (EDW),Subset of EDW relevant to department,Mostly summarized data,Direct dependency on EDW data availability,,,,,,,,,,,,Local Data Mart,External Data,,,Local Data Mart,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,Operational Data,,,,,,,Enterprise Warehouse,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,自底而上設(shè)計(jì)方法,創(chuàng)建部門的數(shù)據(jù)集市,范圍局限于一個(gè)主題區(qū)域,快速的,ROI --,局部的商業(yè)需求得到滿足,本部門自治 -- 設(shè)計(jì)上具有靈活性,對(duì)其他部門數(shù)據(jù)集市是一個(gè)好的指導(dǎo),容易復(fù)制到其他部門,需要為每個(gè)部門做數(shù)據(jù)重建,有一定級(jí)別的冗余和不一致性,一個(gè)切實(shí)可行的方法,擴(kuò)大到企業(yè)數(shù)據(jù)倉(cāng)庫(kù),創(chuàng)建,EDB,作為一個(gè)長(zhǎng)期的目標(biāo),,,局部數(shù)據(jù)集市,,,,,,,,外部數(shù)據(jù),操作型數(shù)據(jù) (全部),,,,操作型數(shù)據(jù),(局部),操作型數(shù)據(jù),(局部),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,局部數(shù)據(jù)集市,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,企業(yè)數(shù)據(jù)倉(cāng)庫(kù),EDB,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,數(shù)據(jù)倉(cāng)庫(kù)建模 — 星型模式,Example of Star Schema,,Date,Month,Year,Date,CustId,CustName,CustCity,CustCountry,Cust,Sales Fact Table,,,Date,,Product,,,Store,,,Customer,,,unit_sales,,,dollar_sales,,,Yen_sales,Measurements,ProductNo,ProdName,ProdDesc,Category,QOH,Product,StoreID,City,State,Country,Region,Store,數(shù)據(jù)倉(cāng)庫(kù)建模 — 雪片模式,,Date,Month,Date,CustId,CustName,CustCity,CustCountry,Cust,,Sales Fact Table,,,Date,,Product,,,Store,,,Customer,,,unit_sales,,,dollar_sales,,,Yen_sales,Measurements,ProductNo,ProdName,ProdDesc,Category,QOH,Product,Month,Year,Month,Year,Year,City,State,City,Country,Region,Country,State,Country,State,StoreID,City,Store,Example of Snowflake Schema,操作型(,OLTP,),數(shù)據(jù)源 --- 銷售庫(kù),星形模式,時(shí)間維,事實(shí)表,多維模型,事實(shí),度量,(,Metrics),時(shí)間維,時(shí)間維的屬性,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)倉(cāng)庫(kù)技術(shù) —,[,Inmon,1996],管理大量數(shù)據(jù),能夠管理大量數(shù)據(jù)的能力,能夠管理好的能力,管理多介質(zhì)(層次),主存、擴(kuò)展內(nèi)存、高速緩存、,DASD、,光盤、縮微膠片,監(jiān)視數(shù)據(jù),決定是否應(yīng)數(shù)據(jù)重組,決定索引是否建立得不恰當(dāng),決定是否有太多數(shù)據(jù)溢出,決定剩余的可用空間,利用多種技術(shù)獲得和傳送數(shù)據(jù),批模式,聯(lián)機(jī)模式并不非常有用,程序員/設(shè)計(jì)者對(duì)數(shù)據(jù)存放位置的控制(塊/頁(yè)),數(shù)據(jù)的并行存儲(chǔ)/管理,元數(shù)據(jù)管理,數(shù)據(jù)倉(cāng)庫(kù)技術(shù) —,[,Inmon,1996],數(shù)據(jù)倉(cāng)庫(kù)語言接口,能夠一次訪問一組數(shù)據(jù),能夠一次訪問一條記錄,支持一個(gè)或多個(gè)索引,有,SQL,接口,數(shù)據(jù)的高效裝入,高效索引的利用,用位映像的方法、多級(jí)索引等,數(shù)據(jù)壓縮,I/O,資源比,CPU,資源少得多,因此數(shù)據(jù)解壓縮不是主要問題,復(fù)合鍵碼(因?yàn)閿?shù)據(jù)隨時(shí)間變化),變長(zhǎng)數(shù)據(jù),加鎖管理(程序員能顯式控制鎖管理程序),單獨(dú)索引處理(查看索引就能提供某些服務(wù)),快速恢復(fù),數(shù)據(jù)倉(cāng)庫(kù)技術(shù) —,[,Inmon,1996],其他技術(shù)特征,傳統(tǒng)技術(shù)起很小作用,事務(wù)集成性、高速緩存、行/頁(yè)級(jí)鎖定、參照完整性、數(shù)據(jù)視圖,傳統(tǒng),DBMS,與數(shù)據(jù)倉(cāng)庫(kù),DBMS,區(qū)別,為數(shù)據(jù)倉(cāng)庫(kù)和決策支持優(yōu)化設(shè)計(jì),管理更多數(shù)據(jù):10,GB/100GB/TB,傳統(tǒng),DBMS,適合記錄級(jí)更新,提供:鎖定,Lock、,提交,Commit、,檢測(cè)點(diǎn),CheckPoint、,日志處理,Log、,死鎖處理,DeadLock、,回退,Roolback.,基本數(shù)據(jù)管理,如:塊管理,傳統(tǒng),DBMS,需要預(yù)留空間,索引區(qū)別:傳統(tǒng),DBMS,限制索引數(shù)量,數(shù)據(jù)倉(cāng)庫(kù),DBMS,沒有限制,通用,DBMS,物理上優(yōu)化便于事務(wù)訪問處理,而數(shù)據(jù)倉(cāng)庫(kù)便于,DSS,訪問分析,改變,DBMS,技術(shù),多維,D,BMS,和數(shù)據(jù)倉(cāng)庫(kù),多維,DBMS,作為數(shù)據(jù)倉(cāng)庫(kù)的數(shù)據(jù)庫(kù)技術(shù),這種想法是不正確的,多維,DBMS(OLAP),是一種技術(shù),數(shù)據(jù)倉(cāng)庫(kù)是一種體系結(jié)構(gòu)的基礎(chǔ),雙重粒度級(jí)別(,DASD/,磁帶),數(shù)據(jù)倉(cāng)庫(kù)技術(shù) —,[,Inmon,1996],數(shù)據(jù)倉(cāng)庫(kù)環(huán)境中的元數(shù)據(jù),DSS,分析人員和,IT,專業(yè)人員不同,需要元數(shù)據(jù)的幫助,操作型環(huán)境和數(shù)據(jù)倉(cāng)庫(kù)環(huán)境之間的映射需要元數(shù)據(jù),數(shù)據(jù)倉(cāng)庫(kù)包含很長(zhǎng)時(shí)間的數(shù)據(jù),必須有元數(shù)據(jù)標(biāo)記數(shù)據(jù)結(jié)構(gòu)/定義,上下文和內(nèi)容(上下文維),簡(jiǎn)單上下文信息(數(shù)據(jù)結(jié)構(gòu)/編碼/命名約定/度量),復(fù)雜上下文信息(產(chǎn)品定義/市場(chǎng)領(lǐng)域/定價(jià)/包裝/組織結(jié)構(gòu)),外部上下文信息(經(jīng)濟(jì)預(yù)測(cè):通貨膨脹、金融、稅收/政治信息/競(jìng)爭(zhēng)信息/技術(shù)進(jìn)展),刷新數(shù)據(jù)倉(cāng)庫(kù),數(shù)據(jù)復(fù)制(觸發(fā)器),變化數(shù)據(jù)捕獲(,CDC)(,日志),提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)倉(cāng)庫(kù)性能,— [,Inmon, 1999],使用,,數(shù)據(jù),,平臺(tái),,服務(wù)管理,王天佑 等譯,《數(shù)據(jù)倉(cāng)庫(kù)管理》, 電子工業(yè)出版社,2000年5月,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,—,DW,用戶數(shù)的調(diào)查,,“,DW,系統(tǒng)的用戶,在100-500以內(nèi)或以上,是未來一段時(shí)期內(nèi),的主要部分“,DW,用戶,的調(diào)查,最近一年,Meta Group Survey,調(diào)查對(duì)象:3000+ 用戶或意向用戶,DW,數(shù)據(jù)規(guī)模的調(diào)查,DW,規(guī)模的調(diào)查,最近一年,Meta Group Survey,調(diào)查對(duì)象:3000+ 用戶或意向用戶,How Much?,$3-6,m for mid-size company, less if smaller, more if larger,$10m+ for large organizations, large data sets,10-50+% annual maintenance costs,33% Hardware / 33% Software / 33% Services,How Long?,2-4,years for 80/20 of full system for mid-size company,6-12 months for initial iteration,3-6 months for subsequent iterations,,How Risky?,For EDW Projects, 20% (Meta) to 70% (OTR, DWN) fail,High failure rate for non-business driven initiatives,Very few systems meet the expectations of the business,Failure not due to technology, due to “soft” issues,Massive upside to successful projects (100% - 2000+% ROI),99%,politics - 1% technology,參考文獻(xiàn),提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)挖掘應(yīng)用綜述,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái),數(shù)據(jù)挖掘應(yīng)用概述,應(yīng)用比例,,Data Mining Upsides,,Data Mining Downsides,,Data Mining Use,,Data Mining Industry and Application,,Data Mining Costs,應(yīng)用比例,,,,Discovery of previously unknown relationships, trends, anomalies, etc.,Powerful competitive weapon,Automation of repetitive analysis,Predictive capabilities,Data Mining Upsides,,,,Knowledge discovery technology immature,Long learning and tuning cycles for some technologies,“Black box” technology minimizes confidence,VLDB (Very Large Data Base) requirements,Data Mining Downsides,Data Mining Uses,Discover anomalies, outliers and exceptions in process data,Discover behavior and predict outcomes of customer relationships,Churn management,Target marketing (market of one),Promotion management,Fraud detection,Pattern ID & matching (dark programs, science),,Data Mining Industry and Applications,From research prototypes to data mining products, languages, and standards,IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.,A few data mining languages and standards (esp. MS OLEDB for Data Mining).,Application achievements in many domains,Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.,Data Mining Costs,Desktop tools: $500 and up (MSFT coming at low price point),Server / MF based: $20,000 to $700,000+,Must also add cost of extensive consulting for high end tools,Don’t forget long training and learning curve time,Ongoing process, not task automation software,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)挖掘趨勢(shì),歷史回顧,,多學(xué)科交叉,,數(shù)據(jù)挖掘從多個(gè)角度分類,,最近十年的研究進(jìn)展,,數(shù)據(jù)挖掘的趨勢(shì),,數(shù)據(jù)挖掘與標(biāo)準(zhǔn)化進(jìn)程,,歷史回顧,1989,IJCAI Workshop on Knowledge Discovery in Databases,Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991),1991-1994 Workshops on Knowledge Discovery in Databases,Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996),1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98),Journal of Data Mining and Knowledge Discovery (1997),1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations,More conferences on data mining,PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.,Data Mining: Confluence of Multiple Disciplines,,,Data Mining,Database,Technology,Statistics,Other,Disciplines,Information,Science,Machine,Learning (AI),Visualization,A Multi-Dimensional View of Data Mining,,Research Progress in the Last Decade,Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing),Association, correlation, and causality analysis,Classification: scalability and new approaches,Clustering and outlier analysis,Sequential patterns and time-series analysis,Similarity analysis: curves, trends, images, texts, etc.,Text mining, Web mining and Weblog analysis,Spatial, multimedia, scientific data analysis,Data preprocessing and database compression,Data visualization and visual data mining,Many others, e.g., collaborative filtering,Research Directions,— [Han J. W. , 2001],Web mining,Towards integrated data mining environments and tools,“Vertical” (or application-specific) data mining,Invisible data mining,Towards intelligent, efficient, and scalable data mining methods,Towards Integrated Data Mining Environments and Tools,OLAP Mining: Integration of Data Warehousing and Data Mining,Querying and Mining: An Integrated Information Analysis Environment,Basic Mining Operations and Mining Query Optimization,“Vertical” (or application-specific) data mining,Invisible data mining,Querying and Mining: An Integrated Information Analysis Environment,Data mining as a component of DBMS, data warehouse, or Web information system,Integrated information processing environment,MS/SQLServer-2000 (Analysis service),IBM IntelligentMiner on DB2,SAS EnterpriseMiner: data warehousing + mining,Query-based mining,Querying database/DW/Web knowledge,Efficiency and flexibility: preprocessing, on-line processing, optimization, integration, etc.,“,Vertical” Data Mining,Generic data mining tools? —Too simple to match domain-specific, sophisticated applications,Expert knowledge and business logic represent many years of work in their own fields!,Data mining + business logic + domain experts,A multi-dimensional view of data miners,Complexity of data: Web, sequence, spatial, multimedia, …,Complexity of domains: DNA, astronomy, market, telecom, …,Domain-specific data mining tools,Provide concrete, killer solution to specific problems,Feedback to build more powerful tools,Invisible Data Mining,Build mining functions into daily information services,Web search engine (link analysis, authoritative pages, user profiles)—adaptive web sites, etc.,Improvement of query processing: history + data,Making service smart and efficient,Benefits from/to data mining research,Data mining research has produced many scalable, efficient, novel mining solutions,Applications feed new challenge problems to research,Towards Intelligent Tools for Data Mining,Integration paves the way to intelligent mining,Smart interface brings intelligence,Easy to use, understand and manipulate,One picture may worth 1,000 words,Visual and audio data mining,Human-Centered Data Mining,Towards self-tuning, self-managing, self-triggering data mining,Integrated Mining: A Booster for Intelligent Mining,Integration paves the way to intelligent mining,Data mining integrates with DBMS, DW, WebDB, etc,Integration inherits the power of up-to-date information technology: querying, MD analysis, similarity search, etc.,Mining can be viewed as querying database knowledge,Integration leads to standard interface/language, function/process standardization, utility, and reachability,Efficiency and scalability bring intelligent mining to reality,數(shù)據(jù)挖掘與標(biāo)準(zhǔn)化進(jìn)程,CRISP—DM,過程標(biāo)準(zhǔn)化,(CRoss-Industry Standard Process for Data Mining),XML,與數(shù)據(jù)預(yù)處理相結(jié)合,SOAP(,Simple Object Access Protocol,),數(shù)據(jù)庫(kù)與系統(tǒng)互操作的標(biāo)準(zhǔn),PMML,預(yù)言模型交換標(biāo)準(zhǔn),OLE DB For Data Mining,數(shù)據(jù)挖掘系統(tǒng)基于,API,的接口,提綱,數(shù)據(jù)倉(cāng)庫(kù)概念,,數(shù)據(jù)倉(cāng)庫(kù)體系結(jié)構(gòu)及組件,,數(shù)據(jù)倉(cāng)庫(kù)設(shè)計(jì),,數(shù)據(jù)倉(cāng)庫(kù)技術(shù)(與數(shù)據(jù)庫(kù)技術(shù)的區(qū)別),,數(shù)據(jù)倉(cāng)庫(kù)性能,,數(shù)據(jù)倉(cāng)庫(kù)應(yīng)用,,數(shù)據(jù)挖掘應(yīng)用概述,,數(shù)據(jù)挖掘技術(shù)與趨勢(shì),,數(shù)據(jù)挖掘應(yīng)用平臺(tái)(科委申請(qǐng)項(xiàng)目),數(shù)據(jù)挖掘應(yīng)用平臺(tái),項(xiàng)目最終目標(biāo),,研究?jī)?nèi)容(含系統(tǒng)結(jié)構(gòu)、層次等),,技術(shù)路線和實(shí)現(xiàn)方法,,關(guān)鍵技術(shù)分析,,成果形式和考核指標(biāo),項(xiàng)目最終目標(biāo)(1),一年內(nèi),研究數(shù)據(jù)挖掘技術(shù),實(shí)現(xiàn)數(shù)據(jù)挖掘主要算法,開發(fā)出擁有自主知識(shí)產(chǎn)權(quán)并具有擴(kuò)充性好、便于應(yīng)用的特點(diǎn)的數(shù)據(jù)挖掘應(yīng)用平臺(tái),建立一套規(guī)范實(shí)用的數(shù)據(jù)挖掘?qū)嶋H應(yīng)用方法論,,,,項(xiàng)目最終目標(biāo)(2),所研究的數(shù)據(jù)挖掘技術(shù)達(dá)到國(guó)際先進(jìn)水平,,實(shí)現(xiàn)主要的數(shù)據(jù)挖掘算法,如關(guān)聯(lián)規(guī)則、聚集、分類等,,所開發(fā)的數(shù)據(jù)挖掘應(yīng)用平臺(tái)擁有自主知識(shí)產(chǎn)權(quán),并具有擴(kuò)充性好,便于應(yīng)用的特點(diǎn),,所建立的數(shù)據(jù)挖掘應(yīng)用的方法論規(guī)范實(shí)用,,,研究?jī)?nèi)容,層次結(jié)構(gòu),,數(shù)據(jù)挖掘 + 商業(yè)邏輯 + 行業(yè)應(yīng)用,,可擴(kuò)展性的體系結(jié)構(gòu),,軟件結(jié)構(gòu),,數(shù)據(jù)挖掘平臺(tái)的應(yīng)用,,針對(duì)行業(yè)的解決方案,,軟件開發(fā)商二次開發(fā),層次結(jié)構(gòu):,數(shù)據(jù)挖掘 + 商業(yè)邏輯 + 行業(yè)應(yīng)用,關(guān)聯(lián)規(guī)則、序列模式、分類、聚集、神經(jīng)元網(wǎng)絡(luò)、偏差分析…,數(shù)據(jù)挖掘算法層,產(chǎn)品推薦、客戶細(xì)分、客戶流失、欺詐甄別、特征分析,…,商業(yè)邏輯層,基因(,DNA),分析、銀行、保險(xiǎn)、電信、證券、零售業(yè),…,行業(yè)應(yīng)用層,,,,,數(shù)據(jù)挖掘應(yīng)用平臺(tái),可擴(kuò)展性的體系結(jié)構(gòu),MIS,ERP,CRM,E_Business,數(shù)據(jù)挖掘應(yīng)用平臺(tái),,探索數(shù)據(jù)倉(cāng)庫(kù),數(shù)據(jù)挖掘,算法庫(kù),,模型庫(kù),組件庫(kù),產(chǎn)品推薦,客戶細(xì)分,客戶流失,欺詐甄別,特征分析,序列分析,… …,,,,,,,,行業(yè)應(yīng)用,知識(shí),,,,,,,,數(shù)據(jù)挖掘應(yīng)用服務(wù)器,信息系統(tǒng),行業(yè)客戶端,,,軟件結(jié)構(gòu),供數(shù)據(jù)挖掘使用的數(shù)據(jù)倉(cāng)庫(kù),,ETL,工具,,數(shù)據(jù)挖掘應(yīng)用服務(wù)器,,數(shù)據(jù)挖掘應(yīng)用服務(wù)器管理平臺(tái),,針對(duì)行業(yè)的分析平臺(tái),數(shù)據(jù)挖掘平臺(tái)的應(yīng)用:,針對(duì)行業(yè)的解決方案,,,,,信息系統(tǒng),數(shù)據(jù)源,,,針對(duì)行業(yè)的數(shù)據(jù)挖掘應(yīng)用,,模型使用,數(shù)據(jù)挖掘平臺(tái)的應(yīng)用,:,軟件開發(fā)商二次開發(fā),,,,,信息系統(tǒng),數(shù)據(jù)源,,,軟件產(chǎn)品:,MIS、ERP、CRM,……,,模型使用,,原來的,軟件產(chǎn)品,,增加數(shù)據(jù)挖掘決策支持模塊,技術(shù)路線和實(shí)現(xiàn)方法,數(shù)據(jù)挖掘應(yīng)用服務(wù)器,應(yīng)用服務(wù)器管理平臺(tái),行業(yè)應(yīng)用,,,1了解掌握研究動(dòng)態(tài),2 商業(yè)模型研究,3 數(shù)據(jù)倉(cāng)庫(kù)建模,4 數(shù)據(jù)挖掘算法實(shí)現(xiàn),5 服務(wù)器框架構(gòu)建,,階段一,階段二,階段三,1 模型創(chuàng)建可視化,2 服務(wù)器調(diào)度和監(jiān)聽,3 數(shù)據(jù)抽取工具研制,4 用戶界面友好,,1 模型顯示可視化,2 模型組件的應(yīng)用,3 特定行業(yè)應(yīng)用,4 組件二次開發(fā)應(yīng)用,5 人機(jī)接口友好,,,,關(guān)鍵技術(shù)分析,商業(yè)模型在數(shù)據(jù)倉(cāng)庫(kù)中的實(shí)現(xiàn),,,商業(yè)模型可視化研究,,模型平滑地嵌入其他應(yīng)用(,ERP,CRM),,ETL,(,抽取、轉(zhuǎn)換、裝載)工具的研制,,,挖掘算法與商業(yè)模型之間的映射關(guān)系,,,數(shù)據(jù)挖掘算法的優(yōu)化,,Any Questions?,Zhujianqiu@,,演講完畢,謝謝觀看!,內(nèi)容總結(jié),數(shù)據(jù)倉(cāng)庫(kù)與數(shù)據(jù)挖掘綜述。每個(gè)鍵碼都有時(shí)間元素(從日期到日期。隨時(shí)間而變化(時(shí)間維)。數(shù)據(jù)集市 -- 小型的,面向部門或工作組級(jí)數(shù)據(jù)倉(cāng)庫(kù)。現(xiàn)實(shí)環(huán)境—異質(zhì)性[Douglas Hackney ,2001]。本部門自治 -- 設(shè)計(jì)上具有靈活性。操作型(OLTP)數(shù)據(jù)源 --- 銷售庫(kù)。批模式,聯(lián)機(jī)模式并不非常有用。程序員/設(shè)計(jì)者對(duì)數(shù)據(jù)存放位置的控制(塊/頁(yè))。單獨(dú)索引處理(查看索引就能提供某些服務(wù))。傳統(tǒng)DBMS與數(shù)據(jù)倉(cāng)庫(kù)DBMS區(qū)別。管理更多數(shù)據(jù):10GB/100GB/TB。通用DBMS物理上優(yōu)化便于事務(wù)訪問處理,而數(shù)據(jù)倉(cāng)庫(kù)便于DSS訪問分析。多維DBMS(OLAP)是一種技術(shù),數(shù)據(jù)倉(cāng)庫(kù)是一種體系結(jié)構(gòu)的基礎(chǔ)。DSS分析人員和IT專業(yè)人員不同,需要元數(shù)據(jù)的幫助。操作型環(huán)境和數(shù)據(jù)倉(cāng)庫(kù)環(huán)境之間的映射需要元數(shù)據(jù)。上下文和內(nèi)容(上下文維)。簡(jiǎn)單上下文信息(數(shù)據(jù)結(jié)構(gòu)/編碼/命名約定/度量)。調(diào)查對(duì)象:3000+ 用戶或意向用戶,