《商務(wù)智能-數(shù)據(jù)挖掘原理》由會(huì)員分享,可在線(xiàn)閱讀,更多相關(guān)《商務(wù)智能-數(shù)據(jù)挖掘原理(41頁(yè)珍藏版)》請(qǐng)?jiān)谘b配圖網(wǎng)上搜索。
1、Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,*,單擊此處編輯母版標(biāo)題樣式,單擊此處編輯母版文本樣式,第二級(jí),第三級(jí),第四級(jí),第五級(jí),*,*,數(shù)據(jù)挖掘原理,趙衛(wèi)東,博士,復(fù)旦大學(xué)軟件學(xué)院,What is Data Mining?,According to the Gartner Group,Data mining is the process of discovering meaningful new co
2、rrelations,patterns and trends by sifting through large amounts of data stored in repositories,using pattern recognition technologies as well as statistical and mathematical techniques.,Data mining refers to the work of discovering,new,and,useful,(business)knowledge from large real databases through
3、 a non-trivial process and using a sound methodology and multiple data processing and analytical techniques.,Examples:,Detect taxation fraud:not declaring all income for taxation;,From the thousands of mobile phone customers,predict which customers are going to switch to a competitor,.,數(shù)據(jù)挖掘受多學(xué)科的影響,數(shù)
4、據(jù)挖掘是一個(gè)交叉科學(xué)領(lǐng)域,受多個(gè)學(xué)科影響,包括數(shù)據(jù)庫(kù)系統(tǒng)、統(tǒng)計(jì)、機(jī)器學(xué)習(xí)、可視化和信息科學(xué)。,一個(gè)比較正式的數(shù)據(jù)挖掘的定義,高層次上的主動(dòng)式自動(dòng)發(fā)現(xiàn)方法,被稱(chēng)為發(fā)現(xiàn)驅(qū)動(dòng)型知識(shí)發(fā)現(xiàn)。,從數(shù)據(jù)中提取,正確的、有用的、未知的和綜合的信息,并用它進(jìn)行決策的過(guò)程。,數(shù)據(jù)挖掘的相關(guān)學(xué)科是統(tǒng)計(jì)理論、數(shù)據(jù)庫(kù)技術(shù)和人工智能。,前Business Objects的Todd Rowe曾表示:“從技術(shù)上講,甚至只要有完備的Excel數(shù)據(jù)就能用上BI?!?過(guò)程,數(shù)據(jù)挖掘并不是一個(gè)裝在軟件包裝盒中的工具可以簡(jiǎn)單的買(mǎi)到并運(yùn)行在商業(yè)智能環(huán)境中,也不會(huì)自動(dòng)開(kāi)始產(chǎn)生值得注意的商業(yè)規(guī)律。,正確的,提取的信息應(yīng)該是正確的,并且在統(tǒng)計(jì)
5、上是重要的以支持有依據(jù)的決定。正確意味著確證性和完整性。不但需要從數(shù)據(jù)庫(kù)中得到正確的客戶(hù),還希望得到所有正確的客戶(hù)。這就需要原始數(shù)據(jù)和數(shù)據(jù)挖掘過(guò)程都具有正確性。,有用的,數(shù)據(jù)挖掘過(guò)程可能會(huì)傳遞正確的和重要的結(jié)果,但是這些知識(shí)必須是對(duì)商業(yè)有用的。如結(jié)果告訴你要在一個(gè)大量的渠道上多樣化市場(chǎng)運(yùn)作,這可能會(huì)無(wú)法辦到。同樣結(jié)果必須使你能搶在競(jìng)爭(zhēng)對(duì)手之前行動(dòng)。,未知的,數(shù)據(jù)挖掘要產(chǎn)生新的信息。如果過(guò)程只是傳遞一些無(wú)關(guān)緊要的結(jié)果,那么數(shù)據(jù)挖掘的商業(yè)動(dòng)力就會(huì)消失。這就是區(qū)分驗(yàn)證和探索的性質(zhì)。,最小要求,以上顯示了數(shù)據(jù)挖掘最小要求,可以用它來(lái)評(píng)價(jià)數(shù)據(jù)挖掘是否對(duì)業(yè)務(wù)環(huán)境增加了附加的價(jià)值,其他要求,Why Data
6、 Mining?,Gain an insight into business data,Identify useful patterns,correlations and models from data automatically to answer questions like,Which customer is likely to churn in two months?,Which customer is my cross sell target?,What are the characteristics of my high spending and low spending cus
7、tomers?,Data mining is a core technology of business intelligence,Data mining is a core application of data warehouses,Data mining is the core technology of analytical CRM,Data mining is the core technology of online recommendation and personalization in e-commerce,Data mining has become a part of b
8、usiness function in many companies,Data mining is regularly used in,典型的數(shù)據(jù)挖掘系統(tǒng)結(jié)構(gòu),Verification-Driven Analysis,Verification-driven data mining tools extract data.The user is expected to generate information based on his interpretation of the returned data.,New Process With Data Mining,Discovery-driven
9、,Computer sifts through millions of hypotheses and only presents the most interesting/valid ones,Example:,From a sample group of clients that have defected to a competitive bank-identify client characteristics that are strongly correlated,and using these attributes,score the rest of the client and p
10、rospect population and the strength of their relationships to sample group.,What Can Data Mining Do?,Classification and Estimation,Prediction,Forecasting,Clustering and Segmentation,Association Discovery,Description and visualization,Market Based Analysis and Up-Selling/Cross-Selling,Pharmaceutical
11、Industry:,Drug Effectiveness by Patient Type,Defect Analysis in,Manufacturing,University and Employee Recruitment,Employee Turnover Predictions,Credit,Risk,Determination,Credit,Card,Fraud,Customer Grouping and Behaviour Prediction,數(shù)據(jù)挖掘過(guò)程,占70%的工作量,是最重要的階段,占25%的工作量,系統(tǒng)演示,Effort Distribution,CRISP,DM is
12、 an iterative,adaptive process.,IBM Intelligent Miner可視化界面,AlphaMiner界面,數(shù)據(jù)挖掘過(guò)程是循環(huán)的過(guò)程,上圖會(huì)容易造成一個(gè)線(xiàn)性過(guò)程的印象。,事實(shí)上,每一步的結(jié)果會(huì)導(dǎo)致這樣一個(gè)結(jié)論:需要從前幾步中得到更多的信息,并不斷重復(fù)這一過(guò)程。這些循環(huán)保證了最后的結(jié)果是完全為業(yè)務(wù)量身定制的。,業(yè)務(wù)分析,理想化地,公司中的所有活動(dòng)都在不同程度上通過(guò)策略和商業(yè)目標(biāo)與公司的任務(wù)描述相關(guān)。數(shù)據(jù)挖掘使你能夠比以前在更高的層次上控制你的目標(biāo)。,業(yè)務(wù)分析涉及到領(lǐng)域?qū)<液屯诰驅(qū)<摇?前者專(zhuān)心于規(guī)定商業(yè)需求,而后者從數(shù)據(jù)挖掘的觀(guān)點(diǎn)上保證這些要求的可行性,并且具
13、體說(shuō)明滿(mǎn)足這些要求所需的挖掘操作。,數(shù)據(jù)分析,為了研究使用統(tǒng)計(jì)方法的數(shù)據(jù),可能有必要清理數(shù)據(jù),添入缺損的值,或者從幾個(gè)系統(tǒng)中將數(shù)據(jù)整合起來(lái)。,數(shù)據(jù)分析將會(huì)對(duì)以后步驟中必須的數(shù)據(jù)轉(zhuǎn)換提供一個(gè)初步的了解,比如數(shù)據(jù)清理和整合??赡芤矔?huì)指出獲取外部的信息是必要的,比如說(shuō)日常商業(yè)運(yùn)作中并不需要的顧客人口統(tǒng)計(jì)數(shù)據(jù)。,在這一步中涉及到的角色是挖掘?qū)<?,他們?zhí)行大部分的任務(wù),還有數(shù)據(jù)庫(kù)管理員,他們將通過(guò)提供數(shù)據(jù)的訪(fǎng)問(wèn)權(quán)限來(lái)支持這些活動(dòng)。,數(shù)據(jù)準(zhǔn)備,當(dāng)挖掘所需的數(shù)據(jù)可供使用時(shí),往往需要在真正進(jìn)行挖掘前做一些準(zhǔn)備工作。對(duì)于是否需要做這些準(zhǔn)備工作,大部分在數(shù)據(jù)分析步驟中進(jìn)行評(píng)估。,數(shù)據(jù)質(zhì)量,數(shù)據(jù)顯示出一些特定的值,
14、叫做偏離點(diǎn),它們遠(yuǎn)離預(yù)期的正常范圍。,這些值可用多種方法來(lái)處理:,如果它們?nèi)栽诂F(xiàn)實(shí)中存在的話(huà),對(duì)這些數(shù)據(jù)取對(duì)數(shù)可以將它們轉(zhuǎn)化到較小的范圍。,否則可以將包含這些值的記錄除去,或者將所有記錄中的相關(guān)屬性除去。,空缺值,一個(gè)更常見(jiàn)的問(wèn)題是空缺值。,此外,有些記錄的值可能空缺,或者某一個(gè)屬性可能會(huì)有大量的空缺值。,對(duì)第一種情況,可以不使用這些記錄;,對(duì)第二種情況,可以丟棄這個(gè)屬性。,猜測(cè)空缺值,另一種處理空缺值的方法是歸咎(imputation)??梢杂脦追N技術(shù)來(lái)猜測(cè)空缺值,下面是一些相關(guān)技術(shù),復(fù)雜度逐漸增加:,從別的記錄中隨機(jī)抽取一個(gè)值添入。,取其他記錄中對(duì)應(yīng)屬性的最頻值,中間數(shù)或平均數(shù)。,對(duì)其他記
15、錄中這個(gè)屬性的值分布做一個(gè)統(tǒng)計(jì)模型,然后根據(jù)分布情況,隨機(jī)選一個(gè)值。,試圖用統(tǒng)計(jì)或挖掘技術(shù)從相似記錄的值中預(yù)估空缺值。,數(shù)據(jù)預(yù)處理,數(shù)據(jù)中的不一致性,數(shù)據(jù)挖掘能夠有效地處理數(shù)據(jù)中的不一致性。即使源數(shù)據(jù)是干凈的、整合的和經(jīng)過(guò)驗(yàn)證的,它們?nèi)杂锌赡馨F(xiàn)實(shí)世界的不真實(shí)的數(shù)據(jù)。,有效認(rèn)識(shí)和解決數(shù)據(jù)質(zhì)量相關(guān)問(wèn)題的唯一辦法,就是企業(yè)對(duì)內(nèi)部處理流程進(jìn)行監(jiān)視、分析和報(bào)告。,美國(guó)硬盤(pán)生產(chǎn)商Maxtor公司的首期信息長(zhǎng)官斯考特.??栒f(shuō)“商務(wù)智能最大的困難在于需要確保用于總結(jié)性分析和儀表板中的最底層的數(shù)據(jù)永遠(yuǎn)干凈、一致并相關(guān)。我們需要數(shù)據(jù)倉(cāng)庫(kù)具備自我治療能力,能夠自動(dòng)地感應(yīng)、偵查、通告和維修任何不正確、缺失或未經(jīng)
16、核對(duì)的數(shù)據(jù)因素。但這至少需要一到兩年才會(huì)發(fā)生?!?噪聲,這種噪聲可能是由用戶(hù)的錯(cuò)誤輸入或是顧客填寫(xiě)問(wèn)卷時(shí)的筆誤造成的。如果這些錯(cuò)誤不是發(fā)生的太頻繁,數(shù)據(jù)挖掘工具還是能夠忽略它們,并且找出數(shù)據(jù)中存在的整體模式。,臟數(shù)據(jù)形成的原因,濫用縮寫(xiě)詞,數(shù)據(jù)輸入錯(cuò)誤,不同的慣用語(yǔ)(如:ASAP對(duì)“at first chance”),重復(fù)記錄,丟失值,拼寫(xiě)變化,不同的計(jì)量單位,過(guò)時(shí)的編碼,數(shù)據(jù)清洗(客戶(hù)數(shù)據(jù)),Maggie.klinefuture_,Margaret Smith-Kline phd,FUTURE Electronics,5/23/03,101 6th ave,manhattan,ny,10012,001124367,Salutation:,Ms.,First name:Margaret,Last name:Smith-Kline,Postname:,Ph.D.,Match standards:,Maggie,Peg,Peggy,Gender:Strong Female,Company name:Future Electronics,Address 1:,101 Avenue of th