国产精品色午夜免费视频,啊嗯啊舒服,乡村美妇揉搓乳峰

今天給大家?guī)硪粋€企業(yè)級數(shù)據(jù)挖掘?qū)崙?zhàn)項目，金融貸款分類模型和時間序列分析，文章較長，建議收藏！

項目背景

銀行和其他金融貸款機構(gòu)經(jīng)常需要查看貸款申請人的信用歷史、經(jīng)濟狀況和其他因素，以確定貸款資格，但這些因素之間的關系通常不是明確定義的，但在本質(zhì)上可以得到啟發(fā)的。通常情況下，公司近況，如其近期的興衰，也被作為決定其財務穩(wěn)定性的考慮因素。因為這些因素考慮不當，或者被忽略了。這很可能會導致對公司拖欠貸款可能性的判斷錯誤。因此，我們可以使用有效的分類和時間序列分析，生成一個好的模型，不僅會更精確，而且會大大降低在解決這個問題上的成本效益。有了這個目標，我們將分析數(shù)據(jù)，并用來自其他集合的數(shù)據(jù)來補充它，并通過創(chuàng)建分類策略和分析模型所采取的步驟，試圖理解與公司財務狀況相關度最高的靜態(tài)因素。

目標變量和預測變量

我們總共使用了42個特征來確定最終分類器中的目標，目標本身是由SVM分類器輸出和ARIMA時間序列分析得到的復合變量。使用的最重要的預測變量是：

Accounts Payable, Capital Expenditures, Additional Income Expense Items, Accounts Receivable and After Tax Return on Equity

（應付帳款、資本支出、額外收入費用項目、應收帳款和稅后權(quán)益回報）。目標變量是公司股票價格在兩年內(nèi)的變化百分比和公司在未來4年內(nèi)破產(chǎn)的可能性的總和。

問題陳述

識別各種靜態(tài)特征，這些特征負責確定公司的增長趨勢，從而確定其獲得貸款的資格。

模型的類型

這些數(shù)據(jù)集包括申請破產(chǎn)的公司的數(shù)據(jù)、紐約證券交易所組織的6年股票趨勢和這些公司的財務數(shù)據(jù)。我們在破產(chǎn)公司數(shù)據(jù)集上嘗試了多種模型，包括決策樹、線性模型和Logistic回歸，并得出基于AUC值的支持向量機最適合于該數(shù)據(jù)集的結(jié)論。采用ARIMA時間序列方法對股票走勢進行了分析。使用這些值將一個復合標簽添加到金融數(shù)據(jù)集中，最后在這個數(shù)據(jù)集上使用隨機森林分類器來解決上述問題。因為不同公司的數(shù)據(jù)存在大量的特征和大的方差，隨機森林模型對數(shù)據(jù)的擬合最好，提供了最好的整體精度。

評價方法

評估數(shù)據(jù)的任務分4個步驟完成

數(shù)據(jù)清理

由于我們操作的數(shù)據(jù)來自不同來源的多個數(shù)據(jù)集，因此數(shù)據(jù)清理是確保這些數(shù)據(jù)集的數(shù)據(jù)表示一致所必需的一項主要操作。

破產(chǎn)預測

申請破產(chǎn)的公司的數(shù)據(jù)集不包含非破產(chǎn)公司的財務特征數(shù)據(jù)，因此我們沒有一個可以用來直接訓練模型的數(shù)據(jù)集。為了解決這個問題，我們使用了來自紐約證券交易所上市公司的更大金融數(shù)據(jù)集的前幾年的數(shù)據(jù)，并將其添加到這個數(shù)據(jù)集中，以確保該數(shù)據(jù)集符合通用規(guī)則。然后我們在這個數(shù)據(jù)集上訓練了一個SVM，并驗證其AUC得分約為0.75。

時間序列分析

公司股票價格數(shù)據(jù)包含紐交所上市的約500家公司的每日收盤價。該數(shù)據(jù)被按比例縮小，以包含每周平均股價。由于時間序列對不同的公司會有不同的表現(xiàn)，因此有必要分別為每個公司建模。不出所料，在某些情況下數(shù)據(jù)顯示了強烈的趨勢和季節(jié)性，必須通過數(shù)據(jù)集的差分來刪除趨勢和季節(jié)性，然后執(zhí)行ARIMA模型。根據(jù)ACF和PACF繪圖分析和手工試驗，選擇的p值和q值分別為2和1。在建模時，平均絕對誤差為~0.05，這表明時間序列分析是相當準確的。

用預測數(shù)據(jù)增強初始數(shù)據(jù)集

包含紐約證券交易所上市公司財務信息的數(shù)據(jù)集用預測破產(chǎn)價值和兩年內(nèi)各自股票價格變化百分比的復合標簽進行了增強。這個標簽是連續(xù)的，我們將它四舍五入到小數(shù)點后一位，然后乘以10得到一個整數(shù)。這個標簽用于訓練隨機森林分類器，以確定模型認為對預測公司的增長趨勢最重要的特征。對特征進行分析并找出特征與標簽之間的相關關系。隨機森林分類器本身的分析是通過觀察產(chǎn)生的混淆矩陣來評估的。由于標簽是多類的，而不是二分類的，因此不能繪制ROC曲線來評價模型結(jié)果。然而馬修斯相關系數(shù)卻可以很好地衡量置信度。

假設 / 限制

1、破產(chǎn)預測是機器學習研究的一個重要課題。關于這個話題有幾篇研究論文，其中幾篇使用了神經(jīng)網(wǎng)絡和先進的機器學習技術來更加精確可靠地預測破產(chǎn)的可能性。我們發(fā)現(xiàn)一些屬性在這些模型中被普遍使用，因此并假設這些屬性與公司破產(chǎn)的概率高度相關。當然，如此決策另一個重要原因是，因為本次項目無法獲得包含更多破產(chǎn)公司財務數(shù)據(jù)的公共數(shù)據(jù)集。因此，我們只使用這些相關度最高的特征來訓練預測器，這是一個極其簡單的破產(chǎn)預測模型。 2、發(fā)現(xiàn) ACF 和 PACF 圖非常模糊，并且不足以幫助確定 AR 和 MA 參數(shù)值。因此，我們嘗試了一些值，并假設 (2,1) 組合最能預測數(shù)據(jù)。 3、增強步驟包括合并二進制預測破產(chǎn)值和連續(xù)平均時間序列預測。我們假設這是一個很好的指標，可以判斷公司是上升還是下降，因此，向公司提供貸款是否安全。

類范圍問題

使用多種分類策略并對時間序列進行建模后，可以通過增加特征和數(shù)據(jù)點的數(shù)量來進一步進行這種分析，以實現(xiàn)更好的破產(chǎn)預測，以及調(diào)整時間序列模型的 AR 和 MA 參數(shù)。鑒于當前的分析，我們發(fā)現(xiàn)了與目標變量相關的多個特征，這種分析有助于補充組織的傳統(tǒng)啟發(fā)式知識。銀行和其他金融貸款機構(gòu)的信息存儲可以使用這種分析來更多地關注這些特征，這是通過代表性或推論分析可能無法實現(xiàn)的。原始提案的變化及其原因：我們最初的提議包含一個策略，即只使用破產(chǎn)預測器來為金融數(shù)據(jù)集提供標簽。然而，經(jīng)過仔細分析，我們發(fā)現(xiàn)它不能作為公司整體地位的一個足夠全面的指標。因此，我們決定通過對公司股票趨勢的時間序列分析來增強它，這將是組織增長/下降的更好指標。

代碼

導入相關模塊

importnumpyasnp
importpandasaspd
importre
importwarnings
frommatplotlibimportpyplotasplt
fromsklearn.model_selectionimporttrain_test_split
fromsklearnimportpreprocessingaspp
fromsklearnimportsvm
fromsklearn.ensembleimportRandomForestClassifier
fromsklearnimportmetrics
fromstatsmodels.stats.stattoolsimportdurbin_watson
fromstatsmodels.graphics.tsaplotsimportplot_acf,plot_pacf
fromstatsmodels.tsaimportarima_model
fromstatsmodels.graphics.apiimportqqplot
warnings.filterwarnings('ignore')

數(shù)據(jù)預處理

定義了幾個函數(shù)，這里包括數(shù)據(jù)清洗、時間解析、穩(wěn)定性檢測。

defcleanColumnName(column):
#刪除列名中的符號
column=re.sub('W+','',column.strip())
#刪除列名末尾的所有空格
column=column.strip()
#用'_'替換單詞之間的空格
returncolumn.lower().replace("","_")

defdateParse(dates):
returnpd.datetime.strptime(dates,'%Y-%m-%d')

deftest_stationarity(ticker,timeseries):
#確定滑動窗口統(tǒng)計
rolmean=timeseries.rolling(window=7,center=False).mean()
rolstd=timeseries.rolling(window=7,center=False).std()

#繪制滑動窗口統(tǒng)計圖:
orig=plt.plot(timeseries,color='blue',label='Original')
mean=plt.plot(rolmean,color='red',label='RollingMean')
std=plt.plot(rolstd,color='black',label='RollingStd')
plt.legend(loc='best')
plt.title(ticker)
plt.show(block=False)

#自相關的durbin_watson統(tǒng)計
dftest=durbin_watson(timeseries)
print(ticker)
print("Durbin-Watsonstatisticfor"+ticker+":",dftest)

數(shù)據(jù)清洗

從數(shù)據(jù)集中刪除不必要的列。這完全是啟發(fā)式的，因為我們完全根據(jù)自己對這些列的意義的理解來刪除它們。

#讀取數(shù)據(jù)
bankrupt_companies=pd.read_csv("public_company_bankruptcy_cases.csv")
companies_stock_prices=pd.read_csv("prices-split-adjusted.csv",
parse_dates=True,
usecols=["date","symbol","close"],
date_parser=dateParse)
nyse_data=pd.read_csv("fundamentals.csv",index_col='Unnamed:0')

bankrupt_companies.drop(["DISTRICT","STATE","COMPANYNAME"],
axis=1,inplace=True)
nyse_data.drop(["DeferredAssetCharges","DeferredLiabilityCharges",
"Depreciation","EarningsBeforeTax","EffectofExchangeRate",
"EquityEarnings/LossUnconsolidatedSubsidiary","Goodwill",
"IncomeTax","IntangibleAssets","InterestExpense","Liabilities",
"MinorityInterest","Misc.Stocks","NetCashFlow-Operating",
"NetCashFlows-Financing","NetCashFlows-Investing",
"NetIncomeAdjustments","NetIncomeApplicabletoCommonShareholders",
"NetIncome-Cont.Operations","OperatingIncome","OperatingMargin",
"OtherAssets","OtherCurrentAssets","OtherCurrentLiabilities",
"OtherFinancingActivities","OtherInvestingActivities",
"OtherLiabilities","OtherOperatingActivities","OtherOperatingItems",
"Pre-TaxMargin","Pre-TaxROE","ResearchandDevelopment",
"TotalCurrentAssets","TotalCurrentLiabilities",
"TotalLiabilities&Equity","TreasuryStock","ForYear"],
axis=1,inplace=True)

#數(shù)據(jù)清理，使列名格式一致
bankrupt_companies.columns=map(cleanColumnName,bankrupt_companies.columns)
bankrupt_companies.columns=["total_assets","total_liabilities"]
companies_stock_prices.columns=map(cleanColumnName,companies_stock_prices.columns)
nyse_data.columns=map(cleanColumnName,nyse_data.columns)

nyse_data.head()

缺失值處理

從各自的數(shù)據(jù)集中刪除NaN值。

bankrupt_companies.dropna(axis=0,subset=['total_assets','total_liabilities'],
inplace=True)

nyse_data.dropna(axis=1,how='any',inplace=True)
nyse_data.dropna(axis=0,how='any',inplace=True)

companies_stock_prices.dropna(axis=0,how='any',inplace=True)

訓練SVM作為破產(chǎn)預測器

創(chuàng)建包含2013年未破產(chǎn)公司數(shù)據(jù)的新dataframe。

nyse_2013=nyse_data.loc[nyse_data['period_ending'].str.contains("2013"),
["total_assets","total_liabilities"]]

nyse_2013=nyse_2013.sample(
n=bankrupt_companies.shape[0],
replace=False)

隨機抽樣該數(shù)據(jù)集，以獲得一個數(shù)據(jù)幀，其中包含與其他數(shù)據(jù)集中破產(chǎn)公司數(shù)量相同的非破產(chǎn)公司的數(shù)據(jù)。

nyse_2013.set_index([[xforxinrange(bankrupt_companies.index[-1]+1,
bankrupt_companies.index[-1]+nyse_2013.shape[0]+1)]],
inplace=True)

手動將列“bankrupt”添加到要用作標簽的數(shù)據(jù)集。

bankrupt_companies["stability"]=0
nyse_2013["stability"]=1

合并破產(chǎn)數(shù)據(jù)和非破產(chǎn)數(shù)據(jù)，生成一個可用于訓練分類器的數(shù)據(jù)。

merged_bankruptcy_dataset=pd.concat(
[bankrupt_companies,nyse_2013])

#縮放數(shù)據(jù)以確保資產(chǎn)和負債在相同的范圍內(nèi)

scaler=pp.MinMaxScaler()
scaler.fit(merged_bankruptcy_dataset[["total_assets","total_liabilities"]])
merged_bankruptcy_dataset[["total_assets","total_liabilities"]]=scaler.transform(merged_bankruptcy_dataset[["total_assets","total_liabilities"]])

將合并的數(shù)據(jù)集隨機分割為訓練數(shù)據(jù)集和測試數(shù)據(jù)集，用于訓練決策樹。

train_bankruptcy_data,test_bankruptcy_data,
train_bankruptcy_target,test_bankruptcy_target=train_test_split(
merged_bankruptcy_dataset.iloc[:,0:-1],
merged_bankruptcy_dataset.iloc[:,-1],
test_size=0.25)

在訓練數(shù)據(jù)上訓練支持向量機。

Svm_model=svm.LinearSVC()
Svm_model.fit(train_bankruptcy_data,train_bankruptcy_target)

print(train_bankruptcy_data.shape,
Svm_model.score(train_bankruptcy_data,train_bankruptcy_target))
print(test_bankruptcy_data.shape,
Svm_model.score(test_bankruptcy_data,test_bankruptcy_target))

((190, 2), 0.83157894736842108) ((64, 2), 0.8125) 計算和繪制ROC和面積下曲線，以了解分類器的準確性

FPR,TPR,_=metrics.roc_curve(test_bankruptcy_target,Svm_model.predict(test_bankruptcy_data))
auc=metrics.auc(FPR,TPR)

plt.plot(FPR,TPR,'b',label='AUCforSVM=%0.2f'%auc)
plt.title("AUCForSVMModel")
plt.legend(loc='best')
plt.plot([0,1],[0,1],'r--')
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

只保留所有公司去年的數(shù)據(jù)。我們將只考慮最新的數(shù)據(jù)，并在原始數(shù)據(jù)集中添加破產(chǎn)的預測值。

nyse_data.drop_duplicates(subset='ticker_symbol',keep='last',inplace=True)

nyse_data["stability"]=Svm_model.predict(scaler.transform(nyse_data[["total_assets",
"total_liabilities"]]))
print("Companiespredictedtogobankruptovera4yearperiod:",
len(nyse_data.loc[nyse_data["stability"]!=1,"ticker_symbol"]))

Companies predicted to go bankrupt over a 4 year period: 114

時間序列分析

companies_stock_prices["date"]=pd.to_datetime(companies_stock_prices["date"],
format="%Y-%m-%d")
companies_stock_prices.dropna(axis=0,
how='any',
inplace=True)

#按股票代碼排序
companies_stock_prices.sort_values(by=["symbol","date"],inplace=True)

假設每個公司的股票趨勢是不同的，我們需要為每個公司建模不同的時間序列，方法是將每個公司的數(shù)據(jù)以單獨的鍵存儲在字典中。字典存儲每個公司的每周股票價格，將每個公司的數(shù)據(jù)添加到字典中的單獨鍵中，這樣就可以對每個公司分別進行時間序列分析。

weekly_stock_prices={}

foriinnp.unique(companies_stock_prices["symbol"].values):
weekly_stock_prices[i]=companies_stock_prices.loc[
companies_stock_prices["symbol"]==i,:].copy()
weekly_stock_prices[i]=weekly_stock_prices[i].reset_index(drop=True)

通過每周只保留一天的數(shù)據(jù)，將每日庫存數(shù)據(jù)轉(zhuǎn)換為每周。因為大約有450家公司，所以只顯示前10個地塊，而且繪制所有地塊需要大量時間。趨勢和季節(jié)性可以假定存在于所有這些。

count=0
foriinweekly_stock_prices:
weekly_mean=weekly_stock_prices[i]["close"].rolling(window=5,center=False).mean()[4:]
#通過每周只保留一天的數(shù)據(jù)，將每日庫存數(shù)據(jù)轉(zhuǎn)換為每周
weekly_stock_prices[i]=weekly_stock_prices[i].loc[weekly_stock_prices[i].index%5==0,:]
weekly_stock_prices[i]["close"]=weekly_mean
weekly_stock_prices[i].index=weekly_stock_prices[i]["date"]
weekly_stock_prices[i].drop(["symbol","date"],axis=1,inplace=True)
weekly_stock_prices[i].dropna(axis=0,how='any',inplace=True)

count+=1
ifcount<=?10:
????????test_stationarity(i,?weekly_stock_prices[i])

AGN ('Durbin-Watson statistic for AGN: ', array([ 0.00106633]))

EOG ('Durbin-Watson statistic for EOG: ', array([ 0.00104565]))

CPB ('Durbin-Watson statistic for CPB: ', array([ 0.00042048]))

EVHC ('Durbin-Watson statistic for EVHC: ', array([ 0.00806171]))

IDXX ('Durbin-Watson statistic for IDXX: ', array([ 0.00094586]))

QRVO ('Durbin-Watson statistic for QRVO: ', array([ 0.00290384]))

JWN ('Durbin-Watson statistic for JWN: ', array([ 0.00088175]))

JBHT ('Durbin-Watson statistic for JBHT: ', array([ 0.00059562]))

TAP ('Durbin-Watson statistic for TAP: ', array([ 0.00062282]))

VRTX ('Durbin-Watson statistic for VRTX: ', array([ 0.00270465])) 正如可以預期的那樣，股票價格數(shù)據(jù)顯示了一個很容易看到的趨勢，而且在許多情況下，更仔細的檢查也會顯示出季節(jié)性的存在。低的Durbin-Watson統(tǒng)計值是高正自相關的證據(jù)，這也是可以理解的，因為股票價格依賴于以前的值。因此，要對該數(shù)據(jù)進行ARIMA分析，首先需要對其進行操作，以得到一個平穩(wěn)的數(shù)據(jù)。

#對數(shù)據(jù)進行平穩(wěn)處理
count=0
weekly_stock_prices_log={}
foriinweekly_stock_prices:
#對數(shù)據(jù)進行差分來去除數(shù)據(jù)中的趨勢和季節(jié)性
weekly_stock_prices_log[i]=weekly_stock_prices[i].copy()
weekly_stock_prices_log[i]["first_difference"]=weekly_stock_prices_log[i]["close"]-weekly_stock_prices_log[i]["close"].shift(1)
weekly_stock_prices_log[i]["seasonal_first_difference"]=weekly_stock_prices_log[i]["first_difference"]-weekly_stock_prices_log[i]["first_difference"].shift(12)

count+=1
ifcount<=10:
????????test_stationarity(i,?weekly_stock_prices_log[i]["seasonal_first_difference"].dropna(inplace=False))

AGN ('Durbin-Watson statistic for AGN: ', 1.8408166958817405)

EOG ('Durbin-Watson statistic for EOG: ', 1.6299518594407623)

CPB ('Durbin-Watson statistic for CPB: ', 1.5454599084578173)

EVHC ('Durbin-Watson statistic for EVHC: ', 1.4213426917002945)

IDXX ('Durbin-Watson statistic for IDXX: ', 1.7448077126902013)

QRVO ('Durbin-Watson statistic for QRVO: ', 1.3805906045088099)

JWN ('Durbin-Watson statistic for JWN: ', 1.6385737145457053)

JBHT ('Durbin-Watson statistic for JBHT: ', 1.6966894515415203)

TAP ('Durbin-Watson statistic for TAP: ', 1.8412354264794373)

VRTX ('Durbin-Watson statistic for VRTX: ', 1.6067817382582221) 現(xiàn)在從結(jié)果可以看出，這已經(jīng)失去了先前所具有的趨勢和季節(jié)性。Durbin-Watson統(tǒng)計量也顯示了一個值~2，因此我們可以得出殘差是平穩(wěn)的，可以繼續(xù)對其進行分析操作。

#通過繪制ACF和PACF圖來確定自回歸和移動平均參數(shù)。
count=0
foriinweekly_stock_prices_log:
fig=plt.figure(figsize=(12,5))
ax1=fig.add_subplot(121)
plot_acf(weekly_stock_prices_log[i]["seasonal_first_difference"].iloc[13:],
lags=50,title="Autocorrelationfor"+i,ax=ax1)
ax2=fig.add_subplot(122)
plot_pacf(weekly_stock_prices_log[i]["seasonal_first_difference"].iloc[13:],
lags=50,title="PartialAutocorrelationfor"+i,ax=ax2)
count+=1
ifcount==5:
break
plt.show()

ACF和PACF圖顯示在滯后1時出現(xiàn)峰值。然而，這些圖本身并不是決定性的，因為沒有一個可以說是指數(shù)下降的，當然，也顯示了一些異常值。使用不同p和q值的試驗在(2,1)處顯示出顯著更好的結(jié)果。對所有公司進行ACF-PACF分析是不可能的，因此對于SVD不收斂于(2,1)的實例，使用了(1,0)的回退值。

count=0
stock_predictions={}
foriinweekly_stock_prices_log:
#將可用數(shù)據(jù)分割為訓練，使用剩余的數(shù)據(jù)點進行準確性檢查
split_point=len(weekly_stock_prices_log[i])-20
#從數(shù)據(jù)集的最后日期到2018-12-31的周數(shù)加117
num_of_predictions=len(weekly_stock_prices_log[i])+117
training=weekly_stock_prices_log[i][0:split_point]
model={}
#首先嘗試使用p=2,q=1建模，如果失敗，使用p=1,q=0
try:
model=arima_model.ARMA(training["close"],order=(2,1)).fit()
except:
model=arima_model.ARMA(training["close"],order=(1,0)).fit()

#在dataframe中添加預測值，以便于進一步的操作。
daterange=pd.date_range(training.index[0],periods=num_of_predictions,freq='W-MON').tolist()
stock_predictions[i]=pd.DataFrame(columns=["date","prediction"])
stock_predictions[i]["date"]=daterange
stock_predictions[i]["prediction"]=model.predict(start=0,end=num_of_predictions)
stock_predictions[i].set_index("date",inplace=True)
#繪制QQPlot來檢查殘差是否均勻分布
ifcountprint("For"+i+":",stats.normaltest(resid))
qqplot(resid,line='q',fit=True)
plt.show()
count+=1

('For AGN: ', NormaltestResult(statistic=472.93123930305205, pvalue=2.0150518495630914e-103))

('For EOG: ', NormaltestResult(statistic=120.49648362661878, pvalue=6.8315780758386102e-27))

('For CPB: ', NormaltestResult(statistic=339.86796767404019, pvalue=1.579823361925116e-74))

('For EVHC: ', NormaltestResult(statistic=69.17501926644907, pvalue=9.5243516902465695e-16))

('For IDXX: ', NormaltestResult(statistic=360.2101109972532, pvalue=6.0446092870173276e-79))

上面這些圖顯示了合理的平等分布，因此我們可以得出結(jié)論，殘差分析是適當?shù)摹? 時間序列模型分析。

count=0
foriinweekly_stock_prices_log:
#將實際值與預測值進行對比
weekly_stock_prices_log[i]["close"].plot()
stock_predictions[i]["prediction"].plot()
plt.show()

#計算驗證數(shù)據(jù)點的平均絕對誤差和平均預測誤差
split_point=len(weekly_stock_prices_log[i])-20
forecastedValues=stock_predictions[i]["prediction"].iloc[split_point:len(weekly_stock_prices_log[i])]
actualValues=weekly_stock_prices_log[i]["close"].iloc[split_point:]
mfe=actualValues.subtract(forecastedValues).mean()
mae=(abs(mfe)/forecastedValues).mean()
display("MeanAbsoluteErrorfor"+i+":"+str(mae))
display("MeanForecastErrorfor"+i+":"+str(mfe))
print"-----"*50
count+=1
ifcount>10:
break

'Mean Absolute Error for AGN: 0.00193187347291' 'Mean Forecast Error for AGN: 0.481341889454' -------------------------------------------------

'Mean Absolute Error for EOG: 0.0798100720231' 'Mean Forecast Error for EOG: 6.73902186871' -------------------------------------------------

'Mean Absolute Error for CPB: 0.00893546704868' 'Mean Forecast Error for CPB: 0.54092487694' -------------------------------------------------

'Mean Absolute Error for EVHC: 0.143090575838' 'Mean Forecast Error for EVHC: -3.51053172619' -------------------------------------------------

'Mean Absolute Error for IDXX: 0.0264690184111' 'Mean Forecast Error for IDXX: 2.85600695121' -------------------------------------------------

'Mean Absolute Error for QRVO: 0.100785079934' 'Mean Forecast Error for QRVO: -5.95620693487' -------------------------------------------------

'Mean Absolute Error for JWN: 0.158397127455' 'Mean Forecast Error for JWN: 6.89272442754' -------------------------------------------------

'Mean Absolute Error for JBHT: 0.0206382415512''Mean Forecast Error for JBHT: 1.66385893237'-------------------------------------------------

'Mean Absolute Error for TAP: 0.0115749676383' 'Mean Forecast Error for TAP: 1.14684278635' -------------------------------------------------

'Mean Absolute Error for VRTX: 0.0246045992625' 'Mean Forecast Error for VRTX: 2.4170609498' -------------------------------------------------

'Mean Absolute Error for BWA: 0.0334229670671' 'Mean Forecast Error for BWA: 1.09813003018' ------------------------------------------------- 平均絕對誤差值約等于0表明時間序列模型具有良好的預測精度。

使用預測數(shù)據(jù)增強初始數(shù)據(jù)集

創(chuàng)建新的列來存儲預測的股票價格，計算一個百分比度量來估計公司股票的上漲或下跌，進而估計組織的增長，以便在所有組織中保持一個公平的范圍。

nyse_data["stock_pred"]=np.nan
foriinstock_predictions:
perc=(stock_predictions[i]["prediction"].tail(105).mean()-stock_predictions[i]["prediction"].tail(105)[0])/stock_predictions[i]["prediction"].tail(105)[0]
nyse_data.loc[nyse_data["ticker_symbol"]==i,"stock_pred"]=perc

將預計的破產(chǎn)價值加到預計的股價中，生成一個能有效代表公司成長或衰退的復合標簽。從數(shù)據(jù)集中刪除不必要的和非數(shù)字列，以方便建模。

nyse_data["stock_pred"]+=nyse_data["stability"]

nyse_data.drop(["period_ending","stability","ticker_symbol"],axis=1,inplace=True)
nyse_data.dropna(axis=0,subset=["stock_pred"],inplace=True)

縮放數(shù)據(jù)集的特性。

nyse_data_scaled=nyse_data.iloc[:,0:-1]
scaler=pp.StandardScaler()
nyse_data_scaled[nyse_data_scaled.columns]=scaler.fit_transform(nyse_data_scaled[nyse_data_scaled.columns])

將目標變量縮放到值-1和1之間，四舍五入到最近的第十位，并乘以10，以生成一個非連續(xù)的多值標簽。

scaler=pp.MinMaxScaler(feature_range=(-1,1))
nyse_data_target_scaled=scaler.fit_transform(nyse_data.iloc[:,-1].reshape(-1,1)).round(decimals=1)*10

將增強數(shù)據(jù)集分割為訓練集和測試集，用于訓練分類器。

train_data,test_data,train_target,test_target=train_test_split(nyse_data_scaled,nyse_data_target_scaled,test_size=0.25)

訓練隨機森林分類器。

RF=RandomForestClassifier()
RF.fit(train_data,train_target)

model_predictions=RF.predict(test_data)

print("Training:-->",train_data.shape,RF.score(train_data,train_target))
print("Testing:-->",test_data.shape,RF.score(test_data,test_target))

('Training:-->', (334, 34), 0.98502994011976053) ('Testing:-->', (112, 34), 0.7410714285714286) 分析隨機森林模型發(fā)現(xiàn)的特征，使其與增強標簽高度相關。觀察數(shù)值相關性。

top_features=np.argsort(RF.feature_importances_[-5:])
top_features=np.append(top_features,-1)
display(nyse_data.iloc[:,top_features].corr())

生成一個混淆矩陣并計算Matthews相關系數(shù)作為訓練的隨機森林分類器的評估指標。

display("CONFUSIONMATRIX:",metrics.confusion_matrix(test_target,model_predictions))
display("MATTHEWSCORRELATIONCO-EFFICIENT",metrics.matthews_corrcoef(test_target,model_predictions))

'CONFUSION MATRIX: ' array([[ 1, 4, 0, 0, 0, 0, 0, 0, 0], [ 2, 17, 0, 0, 0, 0, 0, 0, 0], [ 1, 1, 0, 0, 0, 0, 0, 0, 0], [ 0, 1, 0, 0, 0, 0, 0, 0, 0], [ 0, 0, 0, 0, 0, 4, 0, 0, 0], [ 0, 0, 0, 0, 0, 64, 3, 0, 0], [ 0, 1, 0, 0, 0, 7, 1, 0, 0], [ 0, 0, 0, 0, 0, 3, 1, 0, 0], [ 0, 0, 0, 0, 0, 0, 1, 0, 0]]) 'MATTHEWS CORRELATION CO-EFFICIENT' 0.53348354519442676

審核編輯：李倩

聲明：本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點僅代表作者本人，不代表電子發(fā)燒友網(wǎng)立場。文章及其配圖僅供工程師學習之用，如有內(nèi)容侵權(quán)或者其他違規(guī)問題，請聯(lián)系本站處理。舉報投訴

模型

模型

+關注

關注
1

文章
3486

瀏覽量
49990
數(shù)據(jù)挖掘

數(shù)據(jù)挖掘

+關注

關注
1

文章
406

瀏覽量
24615

原文標題：數(shù)據(jù)挖掘?qū)崙?zhàn)：金融貸款分類模型和時間序列分析

文章出處：【微信號：DBDevs，微信公眾號：數(shù)據(jù)分析與開發(fā)】歡迎添加關注！文章轉(zhuǎn)載請注明出處。

女人自慰AV免费观看内涵网,日韩国产剧情在线观看网址,神马电影网特片网,最新一级电影欧美,在线观看亚洲欧美日韩,黄色视频在线播放免费观看,ABO涨奶期羡澄,第一导航fulione,美女主播操b

搜索歷史

數(shù)據(jù)挖掘?qū)崙?zhàn)：金融貸款分類模型和時間序列分析

問題陳述

模型的類型

評估數(shù)據(jù)的任務分4個步驟完成

數(shù)據(jù)清理

破產(chǎn)預測

時間序列分析

用預測數(shù)據(jù)增強初始數(shù)據(jù)集

導入相關模塊

數(shù)據(jù)預處理

數(shù)據(jù)清洗

缺失值處理

訓練SVM作為破產(chǎn)預測器

時間序列分析

使用預測數(shù)據(jù)增強初始數(shù)據(jù)集

評論