· 6 years ago · Jan 12, 2020, 10:16 AM
1Practical 1
2
3Aim:Creating Data Model using Cassandra.
4
5Go to Cassandra directory
6C:\apache-cassandra-3.11.4\bin
7
8
9Run Cassandra.bat file
10Open C:\apache-cassandra-3.11.4\bin\cqlsh.py with python 2.7 and run
11Creating a Keyspace using Cqlsh
12Create keyspace keyspace1 with replication = {‘class’:’SimpleStratergy’,
13‘replication_factor’: 3};
14Use keyspace1;
15
16
17Create table dept ( dept_id int PRIMARY KEY, dept_name text, dept_loc text);
18
19Create table emp ( emp_id int PRIMARY KEY, emp_name text, dept_id int, email
20text, phone text );
21
22Insert into dept (dept_id, dept_name, dept_loc) values (1001, 'Accounts', 'Mumbai');
23
24Insert into dept (dept_id, dept_name, dept_loc) values (1002, 'Marketing', 'Delhi');
25Insert into dept (dept_id, dept_name, dept_loc) values (1003, 'HR', 'Chennai');
26
27Insert into emp ( emp_id, emp_name, dept_id, email, phone ) values (1001, 'ABCD',1001, 'abcd@company.com', '1122334455');
28
29Insert into emp ( emp_id, emp_name, dept_id, email, phone ) values (1002, 'DEFG', 1001, 'defg@company.com', '2233445566');
30
31Insert into emp ( emp_id, emp_name, dept_id, email, phone ) values (1003, 'GHIJ', 1002, 'ghij@company.com', '3344556677');
32
33Insert into emp ( emp_id, emp_name, dept_id, email, phone ) values (1004, 'JKLM', 1002, 'jklm@company.com', '4455667788');
34
35Insert into emp ( emp_id, emp_name, dept_id, email, phone ) values (1005, 'MNOP',1003, 'mnop@company.com', '5566778899');
36
37Insert into emp ( emp_id, emp_name, dept_id, email, phone ) values (1006, 'MNOP',1003, 'mnop@company.com', '5566778844');
38
39
40
41
42
43
44
45update dept set dept_name='Human Resource' where dept_id=1003;
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81Practical 2
82
83Aim:Write Python / R Program to convert from the following formats to HORUS format:
84Source code is located in C:\VKHCG\05-DS\9999-Data directory
85
86A. Text delimited CSV to HORUS format.
87
88Code:
89# Utility Start CSV to HORUS =================================
90# Standard Tools
91#=============================================================
92import pandas as pd
93# Input Agreement ============================================
94sInputFileName='C:/VKHCG/05-DS/9999-Data/Country_Code.csv'
95InputData=pd.read_csv(sInputFileName,encoding="latin-1")
96print('Input Data Values ===================================')
97print(InputData)
98print('=====================================================')
99# Processing Rules ===========================================
100ProcessData=InputData
101# Remove columns ISO-2-Code and ISO-3-CODE
102ProcessData.drop('ISO-2-CODE', axis=1,inplace=True)
103ProcessData.drop('ISO-3-Code', axis=1,inplace=True)
104# Rename Country and ISO-M49
105ProcessData.rename(columns={'Country': 'CountryName'}, inplace=True)
106ProcessData.rename(columns={'ISO-M49': 'CountryNumber'}, inplace=True)
107# Set new Index
108ProcessData.set_index('CountryNumber', inplace=True)
109# Sort data by CurrencyNumber
110ProcessData.sort_values('CountryName', axis=0, ascending=False, inplace=True)
111print('Process Data Values =================================')
112print(ProcessData)
113print('=====================================================')
114# Output Agreement ===========================================
115OutputData=ProcessData
116sOutputFileName='C:/VKHCG/05-DS/9999-Data/HORUS-CSV-Country.csv'
117OutputData.to_csv(sOutputFileName, index = False)
118print('CSV to HORUS - Done')
119# Utility done ===============================================
120
121
122
123
124
125
126
127
128
129
130
131
132Output:
133
134
135
136
137
138
139
140
141
142
143
144
145
146B. XML to HORUS Format
147
148Code:
149# Utility Start XML to HORUS =================================
150# Standard Tools
151import pandas as pd
152import xml.etree.ElementTree as ET
153def df2xml(data):
154header = data.columns
155root = ET.Element('root')
156for row in range(data.shape[0]):
157entry = ET.SubElement(root,'entry')
158for index in range(data.shape[1]):
159schild=str(header[index])
160child = ET.SubElement(entry, schild)
161if str(data[schild][row]) != 'nan':
162child.text = str(data[schild][row])
163else:
164child.text = 'n/a'
165entry.append(child)
166result = ET.tostring(root)
167return result
168def xml2df(xml_data):
169root = ET.XML(xml_data)
170all_records = []
171for i, child in enumerate(root):
172record = {}
173for subchild in child:
174record[subchild.tag] = subchild.text
175all_records.append(record)
176return pd.DataFrame(all_records)
177sInputFileName='C:/VKHCG/05-DS/9999-Data/Country_Code.xml'
178InputData = open(sInputFileName).read()
179print('=====================================================')
180print('Input Data Values ===================================')
181print('=====================================================')
182print(InputData)
183print('=====================================================')
184#=============================================================
185# Processing Rules ===========================================
186#=============================================================
187ProcessDataXML=InputData
188# XML to Data Frame
189ProcessData=xml2df(ProcessDataXML)
190# Remove columns ISO-2-Code and ISO-3-CODE
191ProcessData.drop('ISO-2-CODE', axis=1,inplace=True)
192ProcessData.drop('ISO-3-Code', axis=1,inplace=True)
193# Rename Country and ISO-M49
194ProcessData.rename(columns={'Country': 'CountryName'}, inplace=True)
195ProcessData.rename(columns={'ISO-M49': 'CountryNumber'}, inplace=True)
196# Set new Index
197ProcessData.set_index('CountryNumber', inplace=True)
198# Sort data by CurrencyNumber
199ProcessData.sort_values('CountryName', axis=0, ascending=False, inplace=True)
200print('=====================================================')
201print('Process Data Values =================================')
202print('=====================================================')
203print(ProcessData)
204print('=====================================================')
205OutputData=ProcessData
206sOutputFileName='C:/VKHCG/05-DS/9999-Data/HORUS-XML-Country.csv'
207OutputData.to_csv(sOutputFileName, index = False)
208print('=====================================================')
209print('XML to HORUS - Done')
210print('=====================================================')
211# Utility done ===============================================
212
213Output:
214
215
216
217
218
219C. JSON to HORUS Format
220Code:
221# Utility Start JSON to HORUS =================================
222# Standard Tools
223#=============================================================
224import pandas as pd
225# Input Agreement ============================================
226sInputFileName='C:/VKHCG/05-DS/9999-Data/Country_Code.json'
227InputData=pd.read_json(sInputFileName, orient='index', encoding="latin-1")
228print('Input Data Values ===================================')
229print(InputData)
230print('=====================================================')
231# Processing Rules ===========================================
232ProcessData=InputData
233# Remove columns ISO-2-Code and ISO-3-CODE
234ProcessData.drop('ISO-2-CODE', axis=1,inplace=True)
235ProcessData.drop('ISO-3-Code', axis=1,inplace=True)
236# Rename Country and ISO-M49
237ProcessData.rename(columns={'Country': 'CountryName'}, inplace=True)
238ProcessData.rename(columns={'ISO-M49': 'CountryNumber'}, inplace=True)
239# Set new Index
240ProcessData.set_index('CountryNumber', inplace=True)
241# Sort data by CurrencyNumber
242ProcessData.sort_values('CountryName', axis=0, ascending=False, inplace=True)
243print('Process Data Values =================================')
244print(ProcessData)
245print('=====================================================')
246# Output Agreement ===========================================
247OutputData=ProcessData
248sOutputFileName='c:/VKHCG/05-DS/9999-Data/HORUS-JSON-Country.csv'
249OutputData.to_csv(sOutputFileName, index = False)
250print('JSON to HORUS - Done')
251# Utility done ===============================================
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270Output:
271
272
273
274
275
276
277
278
279
280
281D. MySql Database to HORUS Format
282
283Code:
284# Utility Start Database to HORUS =================================
285# Standard Tools
286#=============================================================
287import pandas as pd
288import sqlite3 as sq
289# Input Agreement ============================================
290sInputFileName='C:/VKHCG/05-DS/9999-Data/utility.db'
291sInputTable='Country_Code'
292conn = sq.connect(sInputFileName)
293sSQL='select * FROM ' + sInputTable + ';'
294InputData=pd.read_sql_query(sSQL, conn)
295print('Input Data Values ===================================')
296print(InputData)
297print('=====================================================')
298# Processing Rules ===========================================
299ProcessData=InputData
300# Remove columns ISO-2-Code and ISO-3-CODE
301ProcessData.drop('ISO-2-CODE', axis=1,inplace=True)
302ProcessData.drop('ISO-3-Code', axis=1,inplace=True)
303# Rename Country and ISO-M49
304ProcessData.rename(columns={'Country': 'CountryName'}, inplace=True)
305ProcessData.rename(columns={'ISO-M49': 'CountryNumber'}, inplace=True)
306# Set new Index
307ProcessData.set_index('CountryNumber', inplace=True)
308# Sort data by CurrencyNumber
309ProcessData.sort_values('CountryName', axis=0, ascending=False, inplace=True)
310print('Process Data Values =================================')
311print(ProcessData)
312print('=====================================================')
313# Output Agreement ===========================================
314OutputData=ProcessData
315sOutputFileName='C:/VKHCG/05-DS/9999-Data/HORUS-CSV-Country.csv'
316OutputData.to_csv(sOutputFileName, index = False)
317print('Database to HORUS - Done')
318# Utility done ===============================================
319
320
321
322
323
324
325
326
327
328
329
330
331
332Output:
333
334
335
336
337
338
339
340
341
342
343E. Picture (JPEG) to HORUS Format (Use SPYDER to run this program)
344
345Download pillow and scipy
346
347Pip install pillow
348Pip install scipy==1.2.1
349
350Code:
351
352# Utility Start Picture to HORUS =================================
353# Standard Tools
354#=============================================================
355from scipy.misc import imread
356import pandas as pd
357import matplotlib.pyplot as plt
358import numpy as np
359# Input Agreement ============================================
360sInputFileName='C:/VKHCG/05-DS/9999-Data/Angus.jpg'
361InputData = imread(sInputFileName, flatten=False, mode='RGBA')
362print('Input Data Values ===================================')
363print('X: ',InputData.shape[0])
364print('Y: ',InputData.shape[1])
365print('RGBA: ', InputData.shape[2])
366print('=====================================================')
367# Processing Rules ===========================================
368ProcessRawData=InputData.flatten()
369y=InputData.shape[2] + 2
370x=int(ProcessRawData.shape[0]/y)
371ProcessData=pd.DataFrame(np.reshape(ProcessRawData, (x, y)))
372sColumns= ['XAxis','YAxis','Red', 'Green', 'Blue','Alpha']
373ProcessData.columns=sColumns
374ProcessData.index.names =['ID']
375print('Rows: ',ProcessData.shape[0])
376print('Columns :',ProcessData.shape[1])
377print('=====================================================')
378print('Process Data Values =================================')
379print('=====================================================')
380plt.imshow(InputData)
381plt.show()
382print('=====================================================')
383# Output Agreement ===========================================
384OutputData=ProcessData
385print('Storing File')
386sOutputFileName='C:/VKHCG/05-DS/9999-Data/HORUS-Picture.csv'
387OutputData.to_csv(sOutputFileName, index = False)
388print('=====================================================')
389print('Picture to HORUS - Done')
390print('=====================================================')
391
392
393
394Output:
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414F. Video to HORUS Format
415
416pip3 install opencv-python
417
418Code:
419Movie to Frames
420# Utility Start Movie to HORUS (Part 1) ======================
421# Standard Tools
422#=============================================================
423import os
424import shutil
425import cv2
426#=============================================================
427sInputFileName='C:/VKHCG/05-DS/9999-Data/dog.mp4'
428sDataBaseDir='C:/VKHCG/05-DS/9999-Data/temp'
429if os.path.exists(sDataBaseDir):
430shutil.rmtree(sDataBaseDir)
431if not os.path.exists(sDataBaseDir):
432os.makedirs(sDataBaseDir)
433print('=====================================================')
434print('Start Movie to Frames')
435print('=====================================================')
436vidcap = cv2.VideoCapture(sInputFileName)
437success,image = vidcap.read()
438count = 0
439while success:
440success,image = vidcap.read()
441sFrame=sDataBaseDir + str('/dog-frame-' + str(format(count, '04d'))+ '.jpg')
442print('Extracted: ', sFrame)
443cv2.imwrite(sFrame, image)
444if os.path.getsize(sFrame) == 0:
445count += -1
446os.remove(sFrame)
447print('Removed: ', sFrame)
448if cv2.waitKey(10) == 27: # exit if Escape is hit
449break
450count += 1
451print('=====================================================')
452print('Generated : ', count, ' Frames')
453print('=====================================================')
454print('Movie to Frames HORUS - Done')
455print('=====================================================')
456# Utility done =========================================
457
458
459
460
461Output:
462
463
464Now frames are created and need to load them into HORUS.
465Frames to Horus (Use SPYDER to run this program)
466# Utility Start Movie to HORUS (Part 2) ======================
467# Standard Tools
468#=============================================================
469from scipy.misc import imread
470import pandas as pd
471import matplotlib.pyplot as plt
472import numpy as np
473import os
474# Input Agreement ============================================
475sDataBaseDir='C:/VKHCG/05-DS/9999-Data/temp'
476f=0
477for file in os.listdir(sDataBaseDir):
478if file.endswith(".jpg"):
479f += 1
480sInputFileName=os.path.join(sDataBaseDir, file)
481print('Process : ', sInputFileName)
482InputData = imread(sInputFileName, flatten=False, mode='RGBA')
483print('Input Data Values ===================================')
484print('X: ',InputData.shape[0])
485print('Y: ',InputData.shape[1])
486print('RGBA: ', InputData.shape[2])
487print('=====================================================')
488# Processing Rules ===========================================
489ProcessRawData=InputData.flatten()
490y=InputData.shape[2] + 2
491x=int(ProcessRawData.shape[0]/y)
492ProcessFrameData=pd.DataFrame(np.reshape(ProcessRawData, (x, y)))
493ProcessFrameData['Frame']=file
494print('=====================================================')
495print('Process Data Values =================================')
496print('=====================================================')
497plt.imshow(InputData)
498plt.show()
499if f == 1:
500ProcessData=ProcessFrameData
501else:
502ProcessData=ProcessData.append(ProcessFrameData)
503if f > 0:
504sColumns= ['XAxis','YAxis','Red', 'Green', 'Blue','Alpha','FrameName']
505ProcessData.columns=sColumns
506print('=====================================================')
507ProcessFrameData.index.names =['ID']
508print('Rows: ',ProcessData.shape[0])
509print('Columns :',ProcessData.shape[1])
510print('=====================================================')
511# Output Agreement ===========================================
512OutputData=ProcessData
513print('Storing File')
514sOutputFileName='C:/VKHCG/05-DS/9999-Data/HORUS-Movie-Frame.csv'
515OutputData.to_csv(sOutputFileName, index = False)
516print('=====================================================')
517print('Processed ; ', f,' frames')
518print('=====================================================')
519print('Movie to HORUS - Done')
520print('=====================================================')
521
522Output:
523
524
525dog-frame-0000.jpeg dog-fram0001.jpeg
526
527
528dog-frame-0100.jpeg dog-frame 0101.jpeg
529
530Check the files from C:\VKHCG\05-DS\9999-Data\temp
531The movie clip is converted into 102 picture frames and then to HORUS format.
532
533
534
535
536
537
538
539
540
541
542
543Practical 3: Utilities and Auditing
544
545
546A. Fixers Utilities:
547Fixers enable your solution to take your existing data and fix a specific quality issue.
548#---------------------------- Program to Demonstrate Fixers utilities -------------------
549import string
550import datetime as dt
551# 1 Removing leading or lagging spaces from a data entry
552print('#1 Removing leading or lagging spaces from a data entry');
553baddata = " Data Science with too many spaces is bad!!! "
554print('>',baddata,'<')
555cleandata=baddata.strip()
556print('>',cleandata,'<')
557# 2 Removing nonprintable characters from a data entry
558print('#2 Removing nonprintable characters from a data entry')
559printable = set(string.printable)
560baddata = "Data\x00Science with\x02 funny characters is \x10bad!!!"
561cleandata=''.join(filter(lambda x: x in string.printable,baddata))
562print('Bad Data : ',baddata);
563print('Clean Data : ',cleandata)
564# 3 Reformatting data entry to match specific formatting criteria.
565# Convert YYYY/MM/DD to DD Month YYYY
566print('# 3 Reformatting data entry to match specific formatting criteria.')
567baddate = dt.date(2019, 10, 31)
568baddata=format(baddate,'%Y-%m-%d')
569gooddate = dt.datetime.strptime(baddata,'%Y-%m-%d')
570gooddata=format(gooddate,'%d %B %Y')
571print('Bad Data : ',baddata)
572print('Good Data : ',gooddata)
573Output:
574
575
576
577
578
579B. Data Binning or Bucketing
580Binning is a data preprocessing technique used to reduce the effects of minor observation errors. Statistical
581data binning is a way to group a number of more or less continuous values into a smaller number of “bins.”
582
583Code :
584
585import numpy as np
586import matplotlib.mlab as mlab
587import matplotlib.pyplot as plt
588import scipy.stats as stats
589np.random.seed(0)
590# example data
591mu = 90 # mean of distribution
592sigma = 25 # standard deviation of distribution
593x = mu + sigma * np.random.randn(5000)
594num_bins = 25
595fig, ax = plt.subplots()
596# the histogram of the data
597n, bins, patches = ax.hist(x, num_bins, density=1)
598# add a 'best fit' line
599y = stats.norm.pdf(bins, mu, sigma)
600# mlab.normpdf(bins, mu, sigma)
601ax.plot(bins, y, '--')
602ax.set_xlabel('Example Data')
603ax.set_ylabel('Probability density')
604sTitle=r'Histogram ' + str(len(x)) + ' entries into ' + str(num_bins) + ' Bins: $\mu=' + str(mu) + '$, $\sigma=' +str(sigma) + '$'ax.set_title(sTitle)
605fig.tight_layout()
606sPathFig='C:/VKHCG/05-DS/4000-UL/0200-DU/DU-Histogram.png'
607fig.savefig(sPathFig)
608plt.show()
609
610Output:
611
612C. Averaging of Data
613
614The use of averaging of features value enables the reduction of data volumes in a control fashion to improve
615effective data processing.
616
617C:\VKHCG\05-DS\4000-UL\0200-DU\DU-Mean.py
618
619Code:
620import pandas as pd
621################################################################
622InputFileName='IP_DATA_CORE.csv'
623OutputFileName='Retrieve_Router_Location.csv'
624Base='C:/VKHCG'
625print('################################')
626print('Working Base :',Base, ' using ')
627print('################################')
628sFileName=Base + '/01-Vermeulen/00-RawData/' + InputFileName
629print('Loading :',sFileName)
630IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
631usecols=['Country','Place Name','Latitude','Longitude'], encoding="latin-1")
632IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
633AllData=IP_DATA_ALL[['Country', 'Place_Name','Latitude']]
634print(AllData)
635MeanData=AllData.groupby(['Country', 'Place_Name'])['Latitude'].mean()
636print(MeanData)
637Output:
638
639
640
641
642
643
644
645D. Outlier Detection
646Outliers are data that is so different from the rest of the data in the data set that it may be caused by an error in
647the data source. There is a technique called outlier detection that, with good data science, will identify these
648outliers.
649C:\VKHCG\05-DS\4000-UL\0200-DU\DU-Outliers.py
650Code:
651################################################################
652# -*- coding: utf-8 -*-
653################################################################
654import pandas as pd
655################################################################
656InputFileName='IP_DATA_CORE.csv'
657OutputFileName='Retrieve_Router_Location.csv'
658Base='C:/VKHCG'
659print('################################')
660print('Working Base :',Base)
661print('################################')
662################################################################
663sFileName=Base + '/01-Vermeulen/00-RawData/' + InputFileName
664print('Loading :',sFileName)
665IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
666usecols=['Country','Place Name','Latitude','Longitude'], encoding="latin-1")
667IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
668LondonData=IP_DATA_ALL.loc[IP_DATA_ALL['Place_Name']=='London']
669AllData=LondonData[['Country', 'Place_Name','Latitude']]
670print('All Data')
671print(AllData)
672MeanData=AllData.groupby(['Country', 'Place_Name'])['Latitude'].mean()
673StdData=AllData.groupby(['Country', 'Place_Name'])['Latitude'].std()
674print('Outliers')
675UpperBound=float(MeanData+StdData)
676print('Higher than ', UpperBound)
677OutliersHigher=AllData[AllData.Latitude>UpperBound]
678print(OutliersHigher)
679LowerBound=float(MeanData-StdData)
680print('Lower than ', LowerBound)
681OutliersLower=AllData[AllData.Latitude<LowerBound]
682print(OutliersLower)
683print('Not Outliers')
684OutliersNot=AllData[(AllData.Latitude>=LowerBound) & (AllData.Latitude<=UpperBound)]
685print(OutliersNot)
686################################################################
687
688
689
690
691
692
693
694
695Output:
696=========== RESTART: C:\VKHCG\05-DS\4000-UL\0200-DU\DU-Outliers.py ===========
697################################
698Working Base : C:/VKHCG
699################################
700Loading : C:/VKHCG/01-Vermeulen/00-RawData/IP_DATA_CORE.csv
701All Data
702Country Place_Name Latitude
7031910 GB London 51.5130
7041911 GB London 51.5508
7051912 GB London 51.5649
7061913 GB London 51.5895
7071914 GB London 51.5232
708... ... ... ...
7093434 GB London 51.5092
7103435 GB London 51.5092
7113436 GB London 51.5163
7123437 GB London 51.5085
7133438 GB London 51.5136
714[1502 rows x 3 columns]
715Outliers
716Higher than 51.51263550786781
717Country Place_Name Latitude
7181910 GB London 51.5130
7191911 GB London 51.5508
7201912 GB London 51.5649
7211913 GB London 51.5895
7221914 GB London 51.5232
7231916 GB London 51.5491
7241919 GB London 51.5161
7251920 GB London 51.5198
7261921 GB London 51.5198
7271923 GB London 51.5237
7281924 GB London 51.5237
7291925 GB London 51.5237
7301926 GB London 51.5237
7311927 GB London 51.5232
7323436 GB London 51.5163
7333438 GB London 51.5136
734Lower than 51.50617687562166
735Country Place_Name Latitude
7361915 GB London 51.4739
737Not Outliers
738
739
740
741E. Audit
742The audit, balance, and control layer is the area from which you can observe what is currently
743running within your data science environment. It records
744• Process-execution statistics
745• Balancing and controls
746• Rejects and error-handling
747• Codes management
748An audit is a systematic and independent examination of the ecosystem.
749The audit sublayer records the processes that are running at any specific point within the
750environment. This information is used by data scientists and engineers to understand and plan future
751improvements to the processing.
752E. Logging
753Write a Python / R program for basic logging in data science.
754C:\VKHCG\77-Yoke\Yoke_Logging.py
755
756Code:
757import sys
758import os
759import logging
760import uuid
761import shutil
762import time
763############################################################
764Base='C:/VKHCG'
765############################################################
766sCompanies=['01-Vermeulen','02-Krennwallner','03-Hillman','04-Clark']
767sLayers=['01-Retrieve','02-Assess','03-Process','04-Transform','05-Organise','06-Report']
768sLevels=['debug','info','warning','error']
769for sCompany in sCompanies:
770sFileDir=Base + '/' + sCompany
771if not os.path.exists(sFileDir):
772os.makedirs(sFileDir)
773for sLayer in sLayers:
774log = logging.getLogger() # root logger
775for hdlr in log.handlers[:]: # remove all old handlers
776log.removeHandler(hdlr)
777#----------------------------------------------------------------------------------
778sFileDir=Base + '/' + sCompany + '/' + sLayer + '/Logging'
779if os.path.exists(sFileDir):
780shutil.rmtree(sFileDir)
781time.sleep(2)
782if not os.path.exists(sFileDir):
783os.makedirs(sFileDir)
784skey=str(uuid.uuid4())
785sLogFile=Base + '/' + sCompany + '/' + sLayer + '/Logging/Logging_'+skey+'.log'
786print('Set up:',sLogFile)
787# set up logging to file -
788logging.basicConfig(level=logging.DEBUG,
789format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
790datefmt='%m-%d %H:%M',
791filename=sLogFile,
792filemode='w')
793# define a Handler which writes INFO messages or higher to the sys.stderr
794console = logging.StreamHandler()
795console.setLevel(logging.INFO)
796# set a format which is simpler for console use
797formatter = logging.Formatter('%(name)-12s: %(levelname)-8s %(message)s')
798# tell the handler to use this format
799console.setFormatter(formatter)
800# add the handler to the root logger
801logging.getLogger('').addHandler(console)
802# Now, we can log to the root logger, or any other logger. First the root...
803logging.info('Practical Data Science is fun!.')
804for sLevel in sLevels:
805sApp='Apllication-'+ sCompany + '-' + sLayer + '-' + sLevel
806logger = logging.getLogger(sApp)
807if sLevel == 'debug':
808logger.debug('Practical Data Science logged a debugging message.')
809if sLevel == 'info':
810logger.info('Practical Data Science logged information message.')
811if sLevel == 'warning':
812logger.warning('Practical Data Science logged a warning message.')
813if sLevel == 'error':
814logger.error('Practical Data Science logged an error message.')
815#------------------------------------------------------------------------------
816Output:
817
818
819
820
821
822
823
824
825Practical 4
826
827Aim:Retrieve Superstep
828
829A. Perform the following data processing using R.
830
831Use R-Studio for the following:
832
833>library(readr)
834
835Warning message:package ‘readr’ was built under R version 3.4.4
836
837Load a table named IP_DATA_ALL.csv.
838
839>IP_DATA_ALL <- read_csv("C:/VKHCG/01-Vermeulen/00-RawData/IP_DATA_ALL.csv")
840
841Parsed with column specification:
842cols(
843ID = col_double(),
844Country = col_character(),
845`Place Name` = col_character(),
846`Post Code` = col_double(),
847Latitude = col_double(),
848Longitude = col_double(),
849`First IP Number` = col_double(),
850`Last IP Number` = col_double()
851)
852
853>View(IP_DATA_ALL)
854>spec(IP_DATA_ALL)
855
856cols(
857ID = col_double(),
858Country = col_character(),
859`Place Name` = col_character(),
860`Post Code` = col_double(),
861Latitude = col_double(),
862Longitude = col_double(),
863`First IP Number` = col_double(),
864`Last IP Number` = col_double()
865)
866
867This informs you that you have the following eight columns:
868• ID of type integer
869• Place name of type character
870• Post code of type character
871• Latitude of type numeric double
872• Longitude of type numeric double
873• First IP number of type integer
874• Last IP number of type integer
875
876>library(tibble)
877>set_tidy_names(IP_DATA_ALL, syntactic = TRUE, quiet = FALSE)
878
879New names:
880Place Name -> Place.Name
881Post Code -> Post.Code
882First IP Number -> First.IP.Number
883Last IP Number -> Last.IP.Number
884
885This informs you that four of the field names are not valid and suggests new field names that are valid.
886You can fix any detected invalid column names by executing
887
888IP_DATA_ALL_FIX=set_tidy_names(IP_DATA_ALL, syntactic = TRUE, quiet = TRUE)
889
890By using command View(IP_DATA_ALL_FIX), you can check that you have fixed the columns.
891
892The new table IP_DATA_ALL_FIX.csv will fix the invalid column names with valid names.
893
894 >sapply(IP_DATA_ALL_FIX, typeof)
895
896ID Country Place.Name Post.Code Latitude
897"double" "character" "character" "double" "double"
898Longitude First.IP.Number Last.IP.Number
899"double" "double" "double"
900
901>library(data.table)
902>hist_country=data.table(Country=unique(IP_DATA_ALL_FIX[is.na(IP_DATA_ALL_FIX ['Country']) == 0, ]$Country
903))
904
905>setorder(hist_country,'Country')
906>hist_country_with_id=rowid_to_column(hist_country, var = "RowIDCountry")
907>View(hist_country_fix)
908>IP_DATA_COUNTRY_FREQ=data.table(with(IP_DATA_ALL_FIX, table(Country)))
909>View(IP_DATA_COUNTRY_FREQ)
910
911
912• The two biggest subset volumes are from the US and GB.
913• The US has just over four times the data as GB.
914
915>hist_latitude =data.table(Latitude=unique(IP_DATA_ALL_FIX [is.na(IP_DATA_ALL_with_ID ['Latitude']) == 0, ]$Lati
916tude))
917>setkeyv(hist_latitude, 'Latitude')
918>setorder(hist_latitude)
919>ist_latitude_with_id=rowid_to_column(hist_latitude, var = "RowID")
920>View(hist_latitude_with_id)
921>IP_DATA_Latitude_FREQ=data.table(with(IP_DATA_ALL_FIX,table(Latitude)))
922>View(IP_DATA_Latitude_FREQ)
923
924• The two biggest data volumes are from latitudes 51.5092 and 40.6888.
925• The spread appears to be nearly equal between the top-two latitudes.
926
927>sapply(IP_DATA_ALL_FIX[,'Latitude'], min, na.rm=TRUE)
928
929Latitude 40.6888
930
931What does this tell you?
932
933Fact: The range of latitude for the Northern Hemisphere is from 0 to 90. So, if you do not have any latitudes
934farther south than 40.6888, you can improve your retrieve routine.
935
936>sapply(IP_DATA_ALL_FIX[,'Country'], min, na.rm=TRUE)
937
938Country "DE"
939Minimum business frequency is from DE – Denmark.
940
941>sapply(IP_DATA_ALL_FIX[,'Latitude'], max, na.rm=TRUE)
942
943Latitude
94451.5895
945
946>sapply(IP_DATA_ALL_FIX[,'Country'], max, na.rm=TRUE)
947
948Country
949"US"
950The result is 51.5895. What does this tell you?
951Fact: The range in latitude for the Northern Hemisphere is from 0 to 90. So, if you do not have any latitudes
952more northerly than 51.5895, you can improve your retrieve routine.
953
954>sapply(IP_DATA_ALL_FIX [,'Latitude'], mean, na.rm=TRUE)
955
956Latitude
95746.69097
958
959>sapply(IP_DATA_ALL_FIX [,'Latitude'], median, na.rm=TRUE)
960
961Latitude
96248.15
963>sapply(IP_DATA_ALL_FIX [,'Latitude'], range, na.rm=TRUE)
964
965Latitude
966[1,] 40.6888
967[2,] 51.5895
968
969>sapply(IP_DATA_ALL_FIX [,'Latitude'], quantile, na.rm=TRUE)
970
971Latitude
9720% 40.6888
97325% 40.7588
97450% 48.1500
97575% 51.5092
976100% 51.5895
977
978>sapply(IP_DATA_ALL_FIX [,'Latitude'], sd, na.rm=TRUE)
979
980Latitude
9814.890387
982
983>sapply(IP_DATA_ALL_FIX [,'Longitude'], sd, na.rm=TRUE)
984
985Longitude
98638.01702
987
988
989
990
991
992
993
994
995
996B. Program to retrieve different attributes of data.
997
998##### C:\ VKHCG\01-Vermeulen\01-Retrieve\Retrive_IP_DATA_ALL.py###
999
1000import sys
1001import os
1002import pandas as pd
1003################################################################
1004Base='C:/VKHCG'
1005################################################################
1006sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_ALL.csv'
1007print('Loading :',sFileName)
1008IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
1009################################################################
1010sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
1011if not os.path.exists(sFileDir):
1012os.makedirs(sFileDir)
1013print('Rows:', IP_DATA_ALL.shape[0])
1014print('Columns:', IP_DATA_ALL.shape[1])
1015print('### Raw Data Set #####################################')
1016for i in range(0,len(IP_DATA_ALL.columns)):
1017print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
1018print('### Fixed Data Set ###################################')
1019IP_DATA_ALL_FIX=IP_DATA_ALL
1020for i in range(0,len(IP_DATA_ALL.columns)):
1021cNameOld=IP_DATA_ALL_FIX.columns[i] + ' '
1022cNameNew=cNameOld.strip().replace(" ", ".")
1023IP_DATA_ALL_FIX.columns.values[i] = cNameNew
1024print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
1025################################################################
1026#print(IP_DATA_ALL_FIX.head())
1027################################################################
1028print('Fixed Data Set with ID')
1029IP_DATA_ALL_with_ID=IP_DATA_ALL_FIX
1030IP_DATA_ALL_with_ID.index.names = ['RowID']
1031#print(IP_DATA_ALL_with_ID.head())
1032sFileName2=sFileDir + '/Retrieve_IP_DATA.csv'
1033IP_DATA_ALL_with_ID.to_csv(sFileName2, index = True, encoding="latin-1")
1034################################################################
1035print('### Done!! ############################################')
1036
1037
1038
1039
1040
1041
1042Output:
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058C. Data Pattern
1059
1060To determine a pattern of the data values, Replace all
1061alphabet values with an uppercase case A,
1062all numbers with an uppercase N,
1063and replace any spaces with a lowercase letter b
1064and all other unknown characters with a lowercase u.
1065As a result, “Good Book 101” becomes “AAAAbAAAAbNNN.”
1066This pattern creation is beneficial for designing any specific assess rules. This pattern view of data is a quick way to
1067identify common patterns or determine standard layouts.
1068
1069library(readr)
1070library(data.table)
1071FileName=paste0('c:/VKHCG/01-Vermeulen/00-RawData/IP_DATA_ALL.csv')
1072IP_DATA_ALL <- read_csv(FileName)
1073hist_country=data.table(Country=unique(IP_DATA_ALL$Country))
1074pattern_country=data.table(Country=hist_country$Country,PatternCountry=hist_country$Country)
1075
1076oldchar=c(letters,LETTERS)
1077newchar=replicate(length(oldchar),"A")
1078for (r in seq(nrow(pattern_country)))
1079{
1080s=pattern_country[r,]$PatternCountry;
1081for (c in seq(length(oldchar))){
1082s=chartr(oldchar[c],newchar[c],s)
1083};
1084for (n in seq(0,9,1)){
1085s=chartr(as.character(n),"N",s)
1086};
1087s=chartr(" ","b",s)
1088s=chartr(".","u",s)
1089pattern_country[r,]$PatternCountry=s;
1090};
1091View(pattern_country)
1092
1093
1094Example 2:
1095
1096This is a common use of patterns to separate common standards and structures. Pattern can b loaded in separate retrieve procedures. If the same two patterns, NNNNuNNuNN and uuNNuNNuNN, are found, you can send NNNNuNNuNN directly to be converted into a date, while uuNNuNNuNN goes through
1097a quality-improvement process to then route back to the same queue as NNNNuNNuNN, once it complies.
1098
1099library(readr)
1100library(data.table)
1101Base='C:/VKHCG'
1102FileName=paste0(Base,'/01-Vermeulen/00-RawData/IP_DATA_ALL.csv')
1103IP_DATA_ALL <- read_csv(FileName)
1104hist_latitude=data.table(Latitude=unique(IP_DATA_ALL$Latitude))
1105pattern_latitude=data.table(latitude=hist_latitude$Latitude,
1106Patternlatitude=as.character(hist_latitude$Latitude))
1107oldchar=c(letters,LETTERS)
1108newchar=replicate(length(oldchar),"A")
1109for (r in seq(nrow(pattern_latitude))){
1110s=pattern_latitude[r,]$Patternlatitude;
1111for (c in seq(length(oldchar))){
1112s=chartr(oldchar[c],newchar[c],s)
1113};
1114for (n in seq(0,9,1)){
1115s=chartr(as.character(n),"N",s)
1116};
1117s=chartr(" ","b",s)
1118s=chartr("+","u",s)
1119s=chartr("-","u",s)
1120s=chartr(".","u",s)
1121pattern_latitude[r,]$Patternlatitude=s;
1122};
1123setorder(pattern_latitude,latitude)
1124View(pattern_latitude[1:3])
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137D. Loading IP_DATA_ALL:
1138
1139
1140This data set contains all the IP address allocations in the world. It will help you to locate your customers when interacting with them online.
1141
1142Create a new Python script file and save it as Retrieve-
1143
1144IP_DATA_ALL.py in directory
1145C:\VKHCG\01-Vermeulen\01-Retrieve.
1146
1147##############Retrieve-IP_DATA_ALL.py########################
1148# -*- coding: utf-8 -*-
1149################################################################
1150import sys
1151import os
1152import pandas as pd
1153################################################################
1154Base='C:/VKHCG'
1155################################################################
1156sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_ALL.csv'
1157print('Loading :',sFileName)
1158IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
1159################################################################
1160sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
1161if not os.path.exists(sFileDir):
1162os.makedirs(sFileDir)
1163print('Rows:', IP_DATA_ALL.shape[0])
1164print('Columns:', IP_DATA_ALL.shape[1])
1165print('### Raw Data Set #####################################')
1166for i in range(0,len(IP_DATA_ALL.columns)):
1167print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
1168print('### Fixed Data Set ###################################')
1169IP_DATA_ALL_FIX=IP_DATA_ALL
1170for i in range(0,len(IP_DATA_ALL.columns)):
1171cNameOld=IP_DATA_ALL_FIX.columns[i] + ' '
1172cNameNew=cNameOld.strip().replace(" ", ".")
1173IP_DATA_ALL_FIX.columns.values[i] = cNameNew
1174print(IP_DATA_ALL.columns[i],type(IP_DATA_ALL.columns[i]))
1175################################################################
1176#print(IP_DATA_ALL_FIX.head())
1177################################################################
1178print('Fixed Data Set with ID')
1179IP_DATA_ALL_with_ID=IP_DATA_ALL_FIX
1180IP_DATA_ALL_with_ID.index.names = ['RowID']
1181#print(IP_DATA_ALL_with_ID.head())
1182sFileName2=sFileDir + '/Retrieve_IP_DATA.csv'
1183IP_DATA_ALL_with_ID.to_csv(sFileName2, index = True, encoding="latin-1")
1184################################################################
1185print('### Done!! ############################################')
1186Output:
1187
1188Similarly execute the code for:
1189
1190Loading IP_DATA_C_VKHCG
1191Loading IP_DATA_CORE
1192Loading COUNTRY-CODES
1193Loading DE_Billboard_Locations
1194Loading GB_Postcode_Full
1195Loading GB_Postcode_Warehouse
1196Loading GB_Postcode_Shops
1197Loading Euro_ExchangeRates
1198Load: Profit_And_Loss
1199
1200Assisting a company with its processing.
1201
1202The means are as follows:
1203• Identify the data sources required.
1204• Identify source data format (CSV, XML, JSON, or database).
1205• Data profile the data distribution (Skew, Histogram, Min, Max).
1206• Identify any loading characteristics (Columns Names, Data Types, Volumes)
1207• Determine the delivery format (CSV, XML, JSON, or database).
1208
1209
1210
1211
1212
1213Vermeulen PLC
1214
1215The company has two main jobs on which to focus your attention:
1216• Designing a routing diagram for company
1217• Planning a schedule of jobs to be performed for the router network
1218A. Designing a routing diagram for company
1219Start your Python editor and create a text file named
1220Retrieve-IP_Routing.py in directory.
1221C:\VKHCG\01-Vermeulen\01-Retrieve.
1222################################################################
1223# -*- coding: utf-8 -*-
1224################################################################
1225import sys
1226import os
1227import pandas as pd
1228from math import radians, cos, sin, asin, sqrt
1229################################################################
1230def haversine(lon1, lat1, lon2, lat2,stype):
1231"""
1232Calculate the great circle distance between two points
1233on the earth (specified in decimal degrees)
1234"""
1235# convert decimal degrees to radians
1236lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
1237# haversine formula
1238dlon = lon2 - lon1
1239dlat = lat2 - lat1
1240a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
1241c = 2 * asin(sqrt(a))
1242if stype == 'km':
1243r = 6371 # Radius of earth in kilometers
1244else:
1245r = 3956 # Radius of earth in miles
1246d=round(c * r,3)
1247return d
1248################################################################
1249Base='C:/VKHCG'
1250################################################################
1251sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_CORE.csv'
1252print('Loading :',sFileName)
1253IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
1254usecols=['Country','Place Name','Latitude','Longitude'], encoding="latin-1")
1255################################################################
1256sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
1257if not os.path.exists(sFileDir):
1258os.makedirs(sFileDir)
1259################################################################
1260IP_DATA = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
1261IP_DATA.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
1262IP_DATA1 = IP_DATA
1263IP_DATA1.insert(0, 'K', 1)
1264IP_DATA2 = IP_DATA1
1265################################################################
1266print(IP_DATA1.shape)
1267
1268IP_CROSS=pd.merge(right=IP_DATA1,left=IP_DATA2,on='K')
1269IP_CROSS.drop('K', axis=1, inplace=True)
1270IP_CROSS.rename(columns={'Longitude_x': 'Longitude_from', 'Longitude_y': 'Longitude_to'},
1271inplace=True)
1272IP_CROSS.rename(columns={'Latitude_x': 'Latitude_from', 'Latitude_y': 'Latitude_to'},
1273inplace=True)
1274IP_CROSS.rename(columns={'Place_Name_x': 'Place_Name_from', 'Place_Name_y':
1275'Place_Name_to'}, inplace=True)
1276IP_CROSS.rename(columns={'Country_x': 'Country_from', 'Country_y': 'Country_to'},
1277inplace=True)
1278################################################################
1279IP_CROSS['DistanceBetweenKilometers'] = IP_CROSS.apply(lambda row:
1280haversine(
1281row['Longitude_from'],
1282row['Latitude_from'],
1283row['Longitude_to'],
1284row['Latitude_to'],
1285'km')
1286,axis=1)
1287################################################################
1288IP_CROSS['DistanceBetweenMiles'] = IP_CROSS.apply(lambda row:
1289haversine(
1290row['Longitude_from'],
1291row['Latitude_from'],
1292row['Longitude_to'],
1293row['Latitude_to'],
1294'miles')
1295,axis=1)
1296print(IP_CROSS.shape)
1297sFileName2=sFileDir + '/Retrieve_IP_Routing.csv'
1298IP_CROSS.to_csv(sFileName2, index = False, encoding="latin-1")
1299################################################################
1300print('### Done!! ############################################')
1301################################################################
1302Output:
1303See the file named Retrieve_IP_Routing.csv in C:\VKHCG\01-Vermeulen\01-Retrieve\01-EDS\02-Python.
1304
1305
1306
1307Total Records: 22501
1308So, the distance between a router in New York (40.7528, -73.9725) to another router in New York
1309(40.7214, -74.0052) is 4.448 kilometers, or 2.762 miles.
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359B.Building a Diagram for the Scheduling of Jobs
1360
1361Start your Python editor and create a text file named Retrieve-Router-Location.py in directory.
1362C:\VKHCG\01-Vermeulen\01-Retrieve.
1363######### Retrieve-Router-Location.py ###########
1364# -*- coding: utf-8 -*-
1365################################################################
1366import sys
1367import os
1368import pandas as pd
1369################################################################
1370InputFileName='IP_DATA_CORE.csv'
1371OutputFileName='Retrieve_Router_Location.csv'
1372################################################################
1373Base='C:/VKHCG'
1374################################################################
1375sFileName=Base + '/01-Vermeulen/00-RawData/' + InputFileName
1376print('Loading :',sFileName)
1377IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
1378usecols=['Country','Place Name','Latitude','Longitude'], encoding="latin-1")
1379################################################################
1380IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
1381################################################################
1382sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
1383if not os.path.exists(sFileDir):
1384os.makedirs(sFileDir)
1385ROUTERLOC = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
1386print('Rows :',ROUTERLOC.shape[0])
1387print('Columns :',ROUTERLOC.shape[1])
1388sFileName2=sFileDir + '/' + OutputFileName
1389ROUTERLOC.to_csv(sFileName2, index = False, encoding="latin-1")
1390################################################################
1391print('### Done!! ############################################')
1392################################################################
1393Output:
1394
1395
1396
1397See the file named Retrieve_Router_Location.csv in
1398C:\VKHCG\01-Vermeulen\01-Retrieve\01-EDS\02-Python.
1399
1400
1401
1402Krennwallner AG
1403
1404The company has two main jobs in need of your attention:
1405• Picking content for billboards: I will guide you through the data science required to pick advertisements for each billboard in the company.
1406• Understanding your online visitor data: I will guide you through the evaluation of the web traffic to the billboard’s online web servers.
1407
1408
1409A.Picking Content for Billboards
1410
1411Start your Python editor and create a text file named
1412Retrieve-DE-Billboard-Locations.py in directory.
1413C:\VKHCG\02-Krennwallner\01-Retrieve.
1414################# Retrieve-DE-Billboard-Locations.py ###############
1415# -*- coding: utf-8 -*-
1416################################################################
1417import sys
1418import os
1419import pandas as pd
1420################################################################
1421InputFileName='DE_Billboard_Locations.csv'
1422OutputFileName='Retrieve_DE_Billboard_Locations.csv'
1423Company='02-Krennwallner'
1424################################################################
1425Base='C:/VKHCG'
1426print('################################')
1427print('Working Base :',Base, ' using ', sys.platform)
1428print('################################')
1429################################################################
1430Base='C:/VKHCG'
1431sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
1432print('Loading :',sFileName)
1433IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
1434usecols=['Country','PlaceName','Latitude','Longitude'])
1435IP_DATA_ALL.rename(columns={'PlaceName': 'Place_Name'}, inplace=True)
1436################################################################
1437sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
1438if not os.path.exists(sFileDir):
1439os.makedirs(sFileDir)
1440ROUTERLOC = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
1441print('Rows :',ROUTERLOC.shape[0])
1442print('Columns :',ROUTERLOC.shape[1])
1443sFileName2=sFileDir + '/' + OutputFileName
1444ROUTERLOC.to_csv(sFileName2, index = False)
1445################################################################
1446print('### Done!! ############################################')
1447
1448
1449See the file named Retrieve_DE_Billboard_Location.csv in
1450C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python.
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475B.Understanding Your Online Visitor Data
1476
1477Let’s retrieve the visitor data for the billboard we have in Germany.
1478Several times it was found that common and important information is buried somewhere in the company’s various data sources. Investigating any direct suppliers or consumers’ upstream or downstream data sources attached to the specific business process is necessary. That is part of your skills that you are applying to data science. Numerous insightful fragments of information was found in the data sources surrounding a customer’s business processes.
1479
1480Start your Python editor and create a file named
1481Retrieve-Online-Visitor.py in directory
1482C:\VKHCG\02-Krennwallner\01-Retrieve.
1483################################################################
1484# -*- coding: utf-8 -*-
1485################################################################
1486import sys
1487import os
1488import pandas as pd
1489import gzip as gz
1490################################################################
1491InputFileName='IP_DATA_ALL.csv'
1492OutputFileName='Retrieve_Online_Visitor'
1493CompanyIn= '01-Vermeulen'
1494CompanyOut= '02-Krennwallner'
1495################################################################
1496Base='C:/VKHCG'
1497print('################################')
1498print('Working Base :',Base, ' using ', sys.platform)
1499print('################################')
1500################################################################
1501Base='C:/VKHCG'
1502sFileName=Base + '/' + CompanyIn + '/00-RawData/' + InputFileName
1503print('Loading :',sFileName)
1504IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False,
1505usecols=['Country','Place Name','Latitude','Longitude','First IP Number','Last IP Number'])
1506IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
1507IP_DATA_ALL.rename(columns={'First IP Number': 'First_IP_Number'}, inplace=True)
1508IP_DATA_ALL.rename(columns={'Last IP Number': 'Last_IP_Number'}, inplace=True)
1509################################################################
1510sFileDir=Base + '/' + CompanyOut + '/01-Retrieve/01-EDS/02-Python'
1511if not os.path.exists(sFileDir):
1512os.makedirs(sFileDir)
1513visitordata = IP_DATA_ALL.drop_duplicates(subset=None, keep='first', inplace=False)
1514visitordata10=visitordata.head(10)
1515print('Rows :',visitordata.shape[0])
1516print('Columns :',visitordata.shape[1])
1517print('Export CSV')
1518sFileName2=sFileDir + '/' + OutputFileName + '.csv'
1519visitordata.to_csv(sFileName2, index = False)
1520print('Store All:',sFileName2)
1521sFileName3=sFileDir + '/' + OutputFileName + '_10.csv'
1522visitordata10.to_csv(sFileName3, index = False)
1523print('Store 10:',sFileName3)
1524for z in ['gzip', 'bz2', 'xz']:
1525if z == 'gzip':
1526sFileName4=sFileName2 + '.gz'
1527else:
1528sFileName4=sFileName2 + '.' + z
1529visitordata.to_csv(sFileName4, index = False, compression=z)
1530print('Store :',sFileName4)
1531################################################################
1532print('Export JSON')
1533for sOrient in ['split','records','index', 'columns','values','table']:
1534sFileName2=sFileDir + '/' + OutputFileName + '_' + sOrient + '.json'
1535visitordata.to_json(sFileName2,orient=sOrient,force_ascii=True)
1536print('Store All:',sFileName2)
1537sFileName3=sFileDir + '/' + OutputFileName + '_10_' + sOrient + '.json'
1538visitordata10.to_json(sFileName3,orient=sOrient,force_ascii=True)
1539print('Store 10:',sFileName3)
1540sFileName4=sFileName2 + '.gz'
1541file_in = open(sFileName2, 'rb')
1542file_out = gz.open(sFileName4, 'wb')
1543file_out.writelines(file_in)
1544file_in.close()
1545file_out.close()
1546print('Store GZIP All:',sFileName4)
1547sFileName5=sFileDir + '/' + OutputFileName + '_' + sOrient + '_UnGZip.json'
1548file_in = gz.open(sFileName4, 'rb')
1549file_out = open(sFileName5, 'wb')
1550file_out.writelines(file_in)
1551file_in.close()
1552file_out.close()
1553print('Store UnGZIP All:',sFileName5)
1554################################################################
1555print('### Done!! ############################################')
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572Output:
1573
1574
1575
1576See the file named Retrieve_Online_Visitor.csv in C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python.
1577
1578
1579
1580See the file named Retrieve_Online_Visitor_10.csv in C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python.
1581
1582Country,Place_Name,Latitude,Longitude,First_IP_Number,Last_IP_Number
1583BW,Gaborone,-24.6464,25.9119,692781056,692781567
1584BW,Gaborone,-24.6464,25.9119,692781824,692783103
1585BW,Gaborone,-24.6464,25.9119,692909056,692909311
1586BW,Gaborone,-24.6464,25.9119,692909568,692910079
1587BW,Gaborone,-24.6464,25.9119,693051392,693052415
1588BW,Gaborone,-24.6464,25.9119,693078272,693078527
1589BW,Gaborone,-24.6464,25.9119,693608448,693616639
1590BW,Gaborone,-24.6464,25.9119,696929792,696930047
1591BW,Gaborone,-24.6464,25.9119,700438784,700439039
1592BW,Gaborone,-24.6464,25.9119,702075904,702076927
1593
1594You can also see the following JSON files of only ten records.
1595
1596Output:-
1597split : uses the format {index -> [index], columns -> [columns], data -> [values]}.
1598See Retrieve_Online_Visitor_10_split.json.
1599{
1600"columns":
1601"Country","Place_Name","Latitude","Longitude","First_IP_
1602Number","Last_IP_Number"
1603],
1604"index":
1605[
16060,1
1607],
1608"data":[
1609["BW","Gaborone",-24.6464,25.9119,692781056,692781567],
1610["BW","Gaborone",-24.6464,25.9119,692781824,692783103] ]
1611}
1612records : uses the format [{column -> value}, . . ., {column -> value}]. See Retrieve_
1613Online_Visitor_10_records.json.
1614[
1615{
1616"Country":"BW",
1617"Place_Name":"Gaborone",
1618"Latitude":-24.6464,
1619"Longitude":25.9119,
1620"First_IP_Number":692781056,
1621"Last_IP_Number":692781567
1622}
1623,
1624{
1625"Country":"BW",
1626"Place_Name":"Gaborone",
1627"Latitude":-24.6464,
1628"Longitude":25.9119,
1629"First_IP_Number":692781824,
1630"Last_IP_Number":692783103
1631}
1632]
1633index : uses the format {index -> {column -> value}}. See Retrieve_Online_
1634Visitor_10_index.json.
1635{
1636"0":
1637{
1638"Country":"BW",
1639"Place_Name":"Gaborone",
1640"Latitude":-24.6464,
1641"Longitude":25.9119,
1642"First_IP_Number":692781056,
1643"Last_IP_Number":692781567
1644}
1645,
1646"1":
1647{
1648"Country":"BW",
1649"Place_Name":"Gaborone",
1650"Latitude":-24.6464,
1651"Longitude":25.9119,
1652"First_IP_Number":692781824,
1653"Last_IP_Number":692783103
1654}
1655}
1656columns : uses the format {column -> {index -> value}}. See Retrieve_Online_
1657Visitor_10_columns.json.
1658{
1659"Country":
1660{"0":"BW","1":"BW"},
1661"Place_Name":
1662{"0":"Gaborone","1":"Gaborone"},
1663"Latitude":
1664{"0":-24.6464,"1":-24.6464},
1665"Longitude":
1666{"0":25.9119,"1":25.9119},
1667"First_IP_Number":
1668{"0":692781056,"1":692781824},
1669"Last_IP_Number":
1670{"0":692781567,"1":692783103}
1671}
1672values : uses the format of a simple values array. See Retrieve_Online_Visitor_10_
1673values.json.
1674[
1675["BW","Gaborone",-24.6464,25.9119,692781056,692781567]
1676,
1677["BW","Gaborone",-24.6464,25.9119,692781824,692783103]
1678]
1679
1680
1681table : uses the format {‘schema’: {schema}, ‘data’: {data}}. See Retrieve_Online_
1682Visitor_10_table.json.
1683{"schema":
1684{
1685"fields":
1686[
1687{"name":"index","type":"integer"},
1688{"name":"Country","type":"string"},
1689{"name":"Place_Name","type":"string"},
1690{"name":"Latitude","type":"number"},
1691{"name":"Longitude","type":"number"},
1692{"name":"First_IP_Number","type":"integer"},
1693{"name":"Last_IP_Number","type":"integer"}
1694]
1695,
1696"primaryKey":["index"],
1697"pandas_version":"0.20.0"},
1698"data":
1699[
1700{
1701"index":0,
1702"Country":"BW",
1703"Place_Name":"Gaborone",
1704"Latitude":-24.6464, "Longitude":25.9119,
1705"First_IP_Number":692781056,
1706"Last_IP_Number":692781567
1707}
1708,
1709{
1710"index":1,
1711"Country":"BW",
1712"Place_Name":"Gaborone",
1713"Latitude":-24.6464,
1714"Longitude":25.9119,
1715"First_IP_Number":692781824,
1716"Last_IP_Number":692783103
1717}
1718]
1719}
1720
1721XML processing.
1722
1723Start Python editor and create a file named
1724Retrieve-Online-Visitor-XML.py
1725in directory
1726C:\VKHCG\02-Krennwallner\01-Retrieve.
1727################################################################
1728# -*- coding: utf-8 -*-
1729################################################################
1730import sys
1731import os
1732import pandas as pd
1733import xml.etree.ElementTree as ET
1734################################################################
1735def df2xml(data):
1736header = data.columns
1737root = ET.Element('root')
1738for row in range(data.shape[0]):
1739entry = ET.SubElement(root,'entry')
1740for index in range(data.shape[1]):
1741schild=str(header[index])
1742child = ET.SubElement(entry, schild)
1743if str(data[schild][row]) != 'nan':
1744child.text = str(data[schild][row])
1745else:
1746child.text = 'n/a'
1747entry.append(child)
1748result = ET.tostring(root)
1749return result
1750################################################################
1751def xml2df(xml_data):
1752root = ET.XML(xml_data)
1753all_records = []
1754for i, child in enumerate(root):
1755record = {}
1756for subchild in child:
1757record[subchild.tag] = subchild.text
1758all_records.append(record)
1759return pd.DataFrame(all_records)
1760################################################################
1761InputFileName='IP_DATA_ALL.csv'
1762OutputFileName='Retrieve_Online_Visitor.xml'
1763CompanyIn= '01-Vermeulen'
1764CompanyOut= '02-Krennwallner'
1765################################################################
1766if sys.platform == 'linux':
1767Base=os.path.expanduser('~') + '/VKHCG'
1768else:
1769Base='C:/VKHCG'
1770################################################################
1771print('################################')
1772print('Working Base :',Base, ' using ', sys.platform)
1773print('################################')
1774################################################################
1775sFileName=Base + '/' + CompanyIn + '/00-RawData/' + InputFileName
1776print('Loading :',sFileName)
1777IP_DATA_ALL=pd.read_csv(sFileName,header=0,low_memory=False)
1778IP_DATA_ALL.rename(columns={'Place Name': 'Place_Name'}, inplace=True)
1779IP_DATA_ALL.rename(columns={'First IP Number': 'First_IP_Number'}, inplace=True)
1780IP_DATA_ALL.rename(columns={'Last IP Number': 'Last_IP_Number'}, inplace=True)
1781IP_DATA_ALL.rename(columns={'Post Code': 'Post_Code'}, inplace=True)
1782################################################################
1783sFileDir=Base + '/' + CompanyOut + '/01-Retrieve/01-EDS/02-Python'
1784if not os.path.exists(sFileDir):
1785os.makedirs(sFileDir)
1786visitordata = IP_DATA_ALL.head(10000)
1787print('Original Subset Data Frame')
1788print('Rows :',visitordata.shape[0])
1789print('Columns :',visitordata.shape[1])
1790print(visitordata)
1791print('Export XML')
1792sXML=df2xml(visitordata)
1793sFileName=sFileDir + '/' + OutputFileName
1794file_out = open(sFileName, 'wb')
1795file_out.write(sXML)
1796file_out.close()
1797print('Store XML:',sFileName)
1798xml_data = open(sFileName).read()
1799unxmlrawdata=xml2df(xml_data)
1800print('Raw XML Data Frame')
1801print('Rows :',unxmlrawdata.shape[0])
1802print('Columns :',unxmlrawdata.shape[1])
1803print(unxmlrawdata)
1804unxmldata = unxmlrawdata.drop_duplicates(subset=None, keep='first', inplace=False)
1805print('Deduplicated XML Data Frame')
1806print('Rows :',unxmldata.shape[0])
1807print('Columns :',unxmldata.shape[1])
1808print(unxmldata)
1809#################################################################
1810#print('### Done!! ############################################')
1811#################
1812
1813Output:
1814See a file named Retrieve_Online_Visitor.xml in
1815C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python.
1816This enables you to deliver XML format data as part of the retrieve step.
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828Hillman Ltd
1829
1830The company has four main jobs requiring your attention:
1831• Planning the locations of the warehouses: Hillman has countless UK warehouses, but owing to financial hardships, the business wants to shrink the quantity of warehouses by 20%.
1832• Planning the shipping rules for best-fit international logistics: At Hillman Global Logistics’ expense, the company has shipped goods from its international warehouses to its UK shops.
1833This model is no longer sustainable. The co-owned shops now want more feasibility regarding shipping options.
1834• Adopting the best packing option for shipping in containers: Hillman has introduced a new three-size-shipping-container solution. It needs a packing solution encompassing the
1835warehouses, shops, and customers.
1836• Creating a delivery route: Hillman needs to preplan a delivery route for each of its warehouses to shops, to realize a 30% savings in shipping costs.
1837
1838
1839A.Planning Shipping Rules for Best-Fit International Logistics
1840
1841(Befor this Program, first understand the business terms explained in the reference book)
1842
1843EXW—Ex Works (Named Place of Delivery)
1844
1845By this term, the seller makes the goods available at its premises or at another namedplace. This term places the maximum obligation on the buyer and minimum obligationson the seller.
1846
1847Start yourPython editor and create a file named
1848Retrieve-Incoterm-EXW.py in directory
1849C:\VKHCG\03-Hillman\01-Retrieve.
1850################################################################
1851# -*- coding: utf-8 -*-
1852import os
1853import sys
1854import pandas as pd
1855IncoTerm='EXW'
1856InputFileName='Incoterm_2010.csv'
1857OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv'
1858Company='03-Hillman'
1859################################################################
1860Base='C:/VKHCG'
1861print('################################')
1862print('Working Base :',Base, ' using ', sys.platform)
1863print('################################')
1864sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
1865if not os.path.exists(sFileDir):
1866os.makedirs(sFileDir)
1867################################################################
1868### Import Incoterms
1869################################################################
1870sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
1871print('###########')
1872print('Loading :',sFileName)
1873IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False)
1874IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm]
1875print('Rows :',IncotermRule.shape[0])
1876print('Columns :',IncotermRule.shape[1])
1877print('###########')
1878print(IncotermRule)
1879sFileName=sFileDir + '/' + OutputFileName
1880IncotermRule.to_csv(sFileName, index = False)
1881print('### Done!! ############################################')
1882
1883Output
1884See the file named Retrieve_Incoterm_EXW.csv in C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python.
1885Open this file,
1886
1887
1888
1889FCA—Free Carrier (Named Place of Delivery)
1890
1891Under this condition, the seller delivers the goods, cleared for export, at a named place.
1892If I were to buy Practical Data Science at an overseas duty-free shop and then pick it up at the duty-free desk before taking it home, and the shop has shipped it FCA— Free Carrier—to the duty-free desk, the moment pay at the register, the ownership is transferred to me, but if anything happens to the book between the shop and the duty-free desk, the shop will have to pay. It is only once I pick it up at the desk that I will have to pay, if anything happens. So, the moment I take the book, the transaction becomes EXW, so I have to pay any necessary import duties on arrival in my home country. Let’s see what the data science finds.
1893
1894Start your Python editor and create a text file named
1895Retrieve-Incoterm-FCA.py in directory .\VKHCG\03-Hillman\01-Retrieve.
1896################################################################
1897# -*- coding: utf-8 -*-
1898################################################################
1899import os
1900import sys
1901import pandas as pd
1902################################################################
1903IncoTerm='FCA'
1904InputFileName='Incoterm_2010.csv'
1905OutputFileName='Retrieve_Incoterm_' + IncoTerm + '_RuleSet.csv'
1906Company='03-Hillman'
1907################################################################
1908Base='C:/VKHCG'
1909print('################################')
1910print('Working Base :',Base, ' using ', sys.platform)
1911print('################################')
1912################################################################
1913sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
1914if not os.path.exists(sFileDir):
1915os.makedirs(sFileDir)
1916################################################################
1917### Import Incoterms
1918################################################################
1919sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
1920print('###########')
1921print('Loading :',sFileName)
1922IncotermGrid=pd.read_csv(sFileName,header=0,low_memory=False)
1923IncotermRule=IncotermGrid[IncotermGrid.Shipping_Term == IncoTerm]
1924print('Rows :',IncotermRule.shape[0])
1925print('Columns :',IncotermRule.shape[1])
1926print('###########')
1927print(IncotermRule)
1928sFileName=sFileDir + '/' + OutputFileName
1929IncotermRule.to_csv(sFileName, index = False)
1930################################################################
1931print('### Done!! ############################################')
1932################################################################
1933Output:
1934
1935
1936
1937
1938CPT—Carriage Paid To (Named Place of Destination)
1939
1940The seller, under this term, pays for the carriage of the goods up to the named place of destination. However, the goods are considered to be delivered when they have been handed over to the first carrier, so that the risk transfers to the buyer upon handing the goods over to the carrier at the place of shipment in the country of
1941export.
1942
1943Start your Python editor and create a file named
1944Retrieve-Incoterm-CPT.py in directory
1945C:\VKHCG\03-Hillman\01-Retrieve.
1946
1947CIP—Carriage and Insurance Paid To (Named Place of Destination)
1948
1949This term is generally similar to the preceding CPT, with the exception that the seller is required to obtain insurance for the goods while in transit. Following is the data science version.
1950
1951DAP—Delivered at Place (Named Place of Destination)
1952
1953According to Incoterm 2010’s definition, DAP—Delivered at Place—means that, at the disposal of the buyer, the seller delivers when the goods are placed on the arriving means of transport, ready for unloading at the named place of destination. Under DAP terms, the risk passes from seller to buyer from the point of destination mentioned in the contract of delivery.
1954
1955
1956DDP—Delivered Duty Paid (Named Place of Destination)
1957
1958By this term, the seller is responsible for delivering the goods to the named place in the country of the buyer and pays all costs in bringing the goods to the destination, including import duties and taxes. The seller is not responsible for unloading. This term places the maximum obligations on the seller and minimum obligations
1959on the buyer. No risk or responsibility is transferred to the buyer until delivery of the goods at the named place of destination.
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992B.Possible Shipping Routes
1993
1994There are numerous potential shipping routes available to the company. The retrieve step can generate the potential set, by using a route combination generator. This will give you a set of routes, but it is highly unlikely that you will ship along all of them. It is simply a population of routes that can be used by the data science to find the optimum solution.
1995
1996Start your Python editor and create a file named
1997Retrieve-Warehouse-Incoterm-Chains.py in directory
1998C:\VKHCG\03-Hillman\01-Retrieve.
1999
2000################################################################
2001# -*- coding: utf-8 -*-
2002################################################################
2003import os
2004import pandas as pd
2005from math import radians, cos, sin, asin, sqrt
2006################################################################
2007def haversine(lon1, lat1, lon2, lat2,stype):
2008 """
2009 Calculate the great circle distance between two points
2010 on the earth (specified in decimal degrees)
2011 """
2012 # convert decimal degrees to radians
2013 lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
2014
2015 # haversine formula
2016 dlon = lon2 - lon1
2017 dlat = lat2 - lat1
2018 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
2019 c = 2 * asin(sqrt(a))
2020 if stype == 'km':
2021 r = 6371 # Radius of earth in kilometers
2022 else:
2023 r = 3956 # Radius of earth in miles
2024 d=round(c * r,3)
2025 return d
2026################################################################
2027InputFileName='GB_Postcode_Warehouse.csv'
2028OutputFileName='Retrieve_Incoterm_Chain_GB_Warehouse.csv'
2029Company='03-Hillman'
2030################################################################
2031if sys.platform == 'linux':
2032 Base=os.path.expanduser('~') + '/VKHCG'
2033else:
2034 Base='C:/VKHCG'
2035print('################################')
2036print('Working Base :',Base, ' using ', sys.platform)
2037print('################################')
2038################################################################
2039sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
2040if not os.path.exists(sFileDir):
2041 os.makedirs(sFileDir)
2042################################################################
2043### Import Warehouse
2044################################################################
2045sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
2046print('###########')
2047print('Loading :',sFileName)
2048Warehouse=pd.read_csv(sFileName,header=0,low_memory=False)
2049WarehouseGood=Warehouse[Warehouse.latitude != 0]
2050WarehouseGood['Warehouse_Name']=WarehouseGood.apply(lambda row:'WH-' + row['postcode'],axis=1)
2051WarehouseGood.drop('id', axis=1, inplace=True)
2052WarehouseGood.drop('postcode', axis=1, inplace=True)
2053################################################################
2054WarehouseFrom=WarehouseGood.head(100)
2055for i in range(WarehouseFrom.shape[1]):
2056 oldColumn=WarehouseFrom.columns[i]
2057 newColumn=oldColumn + '_from'
2058 WarehouseFrom.rename(columns={oldColumn: newColumn}, inplace=True)
2059WarehouseFrom.insert(3,'Keys', 1)
2060################################################################
2061WarehouseTo=WarehouseGood.head(100)
2062for i in range(WarehouseTo.shape[1]):
2063 oldColumn=WarehouseTo.columns[i]
2064 newColumn=oldColumn + '_to'
2065 WarehouseTo.rename(columns={oldColumn: newColumn}, inplace=True)
2066WarehouseTo.insert(3,'Keys', 1)
2067################################################################
2068WarehouseCross=pd.merge(right=WarehouseFrom,left=WarehouseTo,how='outer',on='Keys')
2069WarehouseCross.drop('Keys', axis=1, inplace=True)
2070WarehouseCross.insert(0,'Incoterm', 'DDP')
2071WarehouseCross['DistanceBetweenKilometers'] = WarehouseCross.apply(lambda row: haversine(
2072row['longitude_from'],row['latitude_from'],row['longitude_to'],row['latitude_to'],'km'),axis=1)
2073WarehouseCross['DistanceBetweenMiles'] = WarehouseCross.apply(lambda row: haversine(
2074row['longitude_from'], row['latitude_from'], row['longitude_to'], row['latitude_to'], 'miles'),axis=1)
2075WarehouseCross.drop('longitude_from', axis=1, inplace=True)
2076WarehouseCross.drop('latitude_from', axis=1, inplace=True)
2077WarehouseCross.drop('longitude_to', axis=1, inplace=True)
2078WarehouseCross.drop('latitude_to', axis=1, inplace=True)
2079WarehouseCrossClean=WarehouseCross[WarehouseCross.DistanceBetweenKilometers !=0]
2080print('###########')
2081print('Rows :',WarehouseCrossClean.shape[0])
2082print('Columns :',WarehouseCrossClean.shape[1])
2083print('###########')
2084################################################################
2085sFileName=sFileDir + '/' + OutputFileName
2086WarehouseCrossClean.to_csv(sFileName, index = False)
2087#################################################################
2088print('### Done!! ############################################')
2089#################################################################
2090
2091You will see a file named Retrieve_Incoterm_Chain_GB_Warehouse.csv in C:\VKHCG\03-Hillman\01-Retrieve\
209201-EDS\02-Python. Open this file, and you should see a data set similar to the following:
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133C.Adopt New Shipping Containers
2134
2135Adopting the best packing option for shipping in containers will require that I introduce a new concept.
2136Shipping of containers is based on a concept reducing the packaging you use down to an optimum set of sizes having the following requirements:
2137• The product must fit within the box formed by the four sides of a cube.
2138• The product can be secured using packing foam, which will fill any void volume in the packaging.
2139• Packaging must fit in shipping containers with zero space gaps.
2140• Containers can only hold product that is shipped to a single warehouse, shop, or customer.
2141
2142Start your Python editor and create a text file named
2143Retrieve-Container-Plan.py in directory .
2144C:\VKHCG\03-Hillman\01-Retrieve.
2145################################################################
2146# -*- coding: utf-8 -*-
2147################################################################
2148import sys
2149import os
2150import pandas as pd
2151################################################################
2152ContainerFileName='Retrieve_Container.csv'
2153BoxFileName='Retrieve_Box.csv'
2154ProductFileName='Retrieve_Product.csv'
2155Company='03-Hillman'
2156################################################################
2157Base='C:/VKHCG'
2158################################################################
2159print('################################')
2160print('Working Base :',Base, ' using ', sys.platform)
2161print('################################')
2162################################################################
2163sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
2164if not os.path.exists(sFileDir):
2165os.makedirs(sFileDir)
2166################################################################
2167### Create the Containers
2168(The business requires a model that simulates shipping containers of dimensions 1
2169meter × 1 meter × 1 meter to 21 meters × 10 meters × 6 meters. So, let’s simulate these.)
2170################################################################
2171containerLength=range(1,21)
2172containerWidth=range(1,10)
2173containerHeigth=range(1,6)
2174containerStep=1
2175c=0
2176for l in containerLength:
2177for w in containerWidth:
2178for h in containerHeigth:
2179containerVolume=(l/containerStep)*(w/containerStep)*(h/containerStep)
2180c=c+1
2181ContainerLine=[('ShipType', ['Container']),('UnitNumber', ('C'+format(c,"06d"))),
2182('Length',(format(round(l,3),".4f"))),('Width',(format(round(w,3),".4f"))),('Height',(format(round(h,3),".4f"))),('ContainerVolume',(format(round(containerVolume,6),".6f")))]
2183if c==1:
2184ContainerFrame = pd.DataFrame.from_dict(ContainerLine)
2185else:
2186ContainerRow = pd.DataFrame.from_dict(ContainerLine)
2187ContainerFrame = ContainerFrame.append(ContainerRow)
2188ContainerFrame.index.name = 'IDNumber'
2189print('################')
2190print('## Container')
2191print('################')
2192print('Rows :',ContainerFrame.shape[0])
2193print('Columns :',ContainerFrame.shape[1])
2194print('################')
2195################################################################
2196sFileContainerName=sFileDir + '/' + ContainerFileName
2197ContainerFrame.to_csv(sFileContainerName, index = False)
2198################################################################
2199## Create valid Boxes with packing foam
2200(Your second simulation is the cardboard boxes for the packing of the products. The
2201requirement is for boxes having a dimension of 100 centimeters × 100 centimeters × 100
2202centimeters to 2.1 meters × 2.1 meters × 2.1 meters. You can also use between zero and
2203600 centimeters of packing foam to secure any product in the box.)
2204################################################################
2205boxLength=range(1,21)
2206boxWidth=range(1,21)
2207boxHeigth=range(1,21)
2208packThick=range(0,6)
2209boxStep=10
2210b=0
2211for l in boxLength:
2212for w in boxWidth:
2213for h in boxHeigth:
2214for t in packThick:
2215boxVolume=round((l/boxStep)*(w/boxStep)*(h/boxStep),6)
2216productVolume=round(((l-t)/boxStep)*((w-t)/boxStep)*((h-t)/boxStep),6)
2217if productVolume > 0:
2218b=b+1
2219BoxLine=[('ShipType', ['Box']),
2220('UnitNumber', ('B'+format(b,"06d"))),
2221('Length',(format(round(l/10,6),".6f"))),
2222('Width',(format(round(w/10,6),".6f"))),
2223('Height',(format(round(h/10,6),".6f"))),
2224('Thickness',(format(round(t/5,6),".6f"))),
2225('BoxVolume',(format(round(boxVolume,9),".9f"))),
2226('ProductVolume',(format(round(productVolume,9),".9f")))]
2227if b==1:
2228BoxFrame = pd.DataFrame.from_dict(BoxLine)
2229else:
2230BoxRow = pd.DataFrame.from_dict(BoxLine)
2231BoxFrame = BoxFrame.append(BoxRow)
2232BoxFrame.index.name = 'IDNumber'
2233print('#################')
2234print('## Box')
2235print('#################')
2236print('Rows :',BoxFrame.shape[0])
2237print('Columns :',BoxFrame.shape[1])
2238print('#################')
2239################################################################
2240sFileBoxName=sFileDir + '/' + BoxFileName
2241BoxFrame.to_csv(sFileBoxName, index = False)
2242################################################################
2243## Create valid Product
2244(Now, your third model is for the products. The requirement is for products having a dimension of 100 centimeters × 100 centimeters × 100 centimeters to 2.1 meters × 2.1 meters × 2.1 meters.)
2245################################################################
2246productLength=range(1,21)
2247productWidth=range(1,21)
2248productHeigth=range(1,21)
2249productStep=10
2250p=0
2251for l in productLength:
2252for w in productWidth:
2253for h in productHeigth:
2254productVolume=round((l/productStep)*(w/productStep)*(h/productStep),6)
2255if productVolume > 0:
2256p=p+1
2257ProductLine=[('ShipType', ['Product']),
2258('UnitNumber', ('P'+format(p,"06d"))),
2259('Length',(format(round(l/10,6),".6f"))),
2260('Width',(format(round(w/10,6),".6f"))),
2261('Height',(format(round(h/10,6),".6f"))),
2262('ProductVolume',(format(round(productVolume,9),".9f")))]
2263if p==1:
2264ProductFrame = pd.DataFrame.from_dict(ProductLine)
2265else:
2266ProductRow = pd.DataFrame.from_dict(ProductLine)
2267ProductFrame = ProductFrame.append(ProductRow)
2268BoxFrame.index.name = 'IDNumber'
2269print('#################')
2270print('## Product')
2271print('#################')
2272print('Rows :',ProductFrame.shape[0])
2273print('Columns :',ProductFrame.shape[1])
2274print('#################')
2275################################################################
2276sFileProductName=sFileDir + '/' + ProductFileName
2277ProductFrame.to_csv(sFileProductName, index = False)
2278################################################################
2279#################################################################
2280print('### Done!! ############################################')
2281#################################################################
2282Output:
2283
2284
2285
2286See the container data file Retrieve_Container.csv and Retrieve_Box.csv in
2287C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python.
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317D.Create a Delivery Route
2318
2319The model enables you to generate a complex routing plan for the shipping routes of the company.
2320
2321Start yourPython editor and create a text file named Retrieve-Route-Plan.py in directory .
2322C:\VKHCG\03-Hillman\01-Retrieve.
2323################################################################
2324# -*- coding: utf-8 -*-
2325import os
2326import sys
2327import pandas as pd
2328from geopy.distance import vincenty
2329################################################################
2330InputFileName='GB_Postcode_Warehouse.csv'
2331OutputFileName='Retrieve_GB_Warehouse.csv'
2332Company='03-Hillman'
2333################################################################
2334Base='C:/VKHCG'
2335print('################################')
2336print('Working Base :',Base, ' using ', sys.platform)
2337print('################################')
2338################################################################
2339sFileDir=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
2340if not os.path.exists(sFileDir):
2341os.makedirs(sFileDir)
2342################################################################
2343sFileName=Base + '/' + Company + '/00-RawData/' + InputFileName
2344print('###########')
2345print('Loading :',sFileName)
2346Warehouse=pd.read_csv(sFileName,header=0,low_memory=False)
2347WarehouseClean=Warehouse[Warehouse.latitude != 0]
2348WarehouseGood=WarehouseClean[WarehouseClean.longitude != 0]
2349WarehouseGood.drop_duplicates(subset='postcode', keep='first', inplace=True)
2350WarehouseGood.sort_values(by='postcode', ascending=1)
2351################################################################
2352sFileName=sFileDir + '/' + OutputFileName
2353WarehouseGood.to_csv(sFileName, index = False)
2354################################################################
2355WarehouseLoop = WarehouseGood.head(20)
2356for i in range(0,WarehouseLoop.shape[0]):
2357print('Run :',i,' =======>>>>>>>>>>',WarehouseLoop['postcode'][i])
2358WarehouseHold = WarehouseGood.head(10000)
2359WarehouseHold['Transaction']=WarehouseHold.apply(lambda row:'WH-to-WH',axis=1)
2360OutputLoopName='Retrieve_Route_' + 'WH-' + WarehouseLoop['postcode'][i] + '_Route.csv'
2361WarehouseHold['Seller']=WarehouseHold.apply(lambda row:'WH-' + WarehouseLoop['postcode'][i]
2362,axis=1)
2363WarehouseHold['Seller_Latitude']=WarehouseHold.apply(lambda row:
2364WarehouseHold['latitude'][i],axis=1)
2365WarehouseHold['Seller_Longitude']=WarehouseHold.apply(lambda row:
2366WarehouseLoop['longitude'][i],axis=1)
2367WarehouseHold['Buyer']=WarehouseHold.apply(lambda row:'WH-' + row['postcode'],axis=1)
2368WarehouseHold['Buyer_Latitude']=WarehouseHold.apply(lambda row:row['latitude'],axis=1)
2369WarehouseHold['Buyer_Longitude']=WarehouseHold.apply(lambda row:row['longitude'],axis=1)
2370WarehouseHold['Distance']=WarehouseHold.apply(lambda row: round(
2371vincenty((WarehouseLoop['latitude'][i],WarehouseLoop['longitude'][i]),
2372(row['latitude'],row['longitude'])).miles,6),axis=1)
2373WarehouseHold.drop('id', axis=1, inplace=True)
2374WarehouseHold.drop('postcode', axis=1, inplace=True)
2375WarehouseHold.drop('latitude', axis=1, inplace=True)
2376WarehouseHold.drop('longitude', axis=1, inplace=True)
2377################################################################
2378sFileLoopName=sFileDir + '/' + OutputLoopName
2379WarehouseHold.to_csv(sFileLoopName, index = False)
2380#################################################################
2381print('### Done!! ############################################')
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417Output:
2418
2419====== RESTART: C:\VKHCG\03-Hillman\01-Retrieve\Retrieve-Route-Plan.py ======
2420################################
2421Working Base : C:/VKHCG using win32
2422################################
2423###########
2424Loading : C:/VKHCG/03-Hillman/00-RawData/GB_Postcode_Warehouse.csv
2425Run : 0 =======>>>>>>>>>> AB10
2426Run : 1 =======>>>>>>>>>> AB11
2427Run : 2 =======>>>>>>>>>> AB12
2428Run : 3 =======>>>>>>>>>> AB13
2429Run : 4 =======>>>>>>>>>> AB14
2430Run : 5 =======>>>>>>>>>> AB15
2431Run : 6 =======>>>>>>>>>> AB16
2432Run : 7 =======>>>>>>>>>> AB21
2433Run : 8 =======>>>>>>>>>> AB22
2434Run : 9 =======>>>>>>>>>> AB23
2435Run : 10 =======>>>>>>>>>> AB24
2436Run : 11 =======>>>>>>>>>> AB25
2437Run : 12 =======>>>>>>>>>> AB30
2438Run : 13 =======>>>>>>>>>> AB31
2439Run : 14 =======>>>>>>>>>> AB32
2440Run : 15 =======>>>>>>>>>> AB33
2441Run : 16 =======>>>>>>>>>> AB34
2442Run : 17 =======>>>>>>>>>> AB35
2443Run : 18 =======>>>>>>>>>> AB36
2444Run : 19 =======>>>>>>>>>> AB37
2445### Done!! ############################################
2446>>>
2447See the collection of files similar in format to Retrieve_Route_WH-AB11_Route.csv in
2448C:\VKHCG\03-Hillman\01-Retrieve\01-EDS\02-Python.
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465Clark Ltd
2466
2467
2468Clark is the financial powerhouse of the group. It must process all the money-related data sources.
2469Forex-The first financial duty of the company is to perform any foreign exchange trading.
2470Forex Base Data-Previously, you found a single data source (Euro_ExchangeRates.csv) for forex rates in Clark.
2471
2472The relevant file is Retrieve_Retrieve_Euro_ExchangeRates.csv in directory
2473C:\ VKHCG\04-Clark\01-Retrieve\01-EDS\01-R.
2474So, that data is ready.
2475
2476Financials - Clark generates the financial statements for all the group’s companies.
2477Financial Base Data - You found a single data source (Profit_And_Loss.csv) in Clark for
2478financials and, as mentioned previously, a single data source (Euro_ExchangeRates.csv) for forex rates.
2479
2480The file relevant file is Retrieve_Profit_And_Loss.csv in directory
2481C:\VKHCG\04-Clark\01-Retrieve\ 01-EDS\01-R.
2482
2483
2484Person Base Data
2485Start Python editor and create a file named Retrieve-PersonData.py in directory .
2486C:\VKHCG\04-Clark\01-Retrieve.
2487################################################################
2488# -*- coding: utf-8 -*-
2489################################################################
2490import sys
2491import os
2492import shutil
2493import zipfile
2494import pandas as pd
2495################################################################
2496Base='C:/VKHCG'
2497################################################################
2498print('################################')
2499print('Working Base :',Base, ' using ', sys.platform)
2500print('################################')
2501################################################################
2502Company='04-Clark'
2503ZIPFiles=['Data_female-names','Data_male-names','Data_last-names']
2504for ZIPFile in ZIPFiles:
2505InputZIPFile=Base+'/'+Company+'/00-RawData/' + ZIPFile + '.zip'
2506OutputDir=Base+'/'+Company+'/01-Retrieve/01-EDS/02-Python/' + ZIPFile
2507OutputFile=Base+'/'+Company+'/01-Retrieve/01-EDS/02-Python/Retrieve-'+ZIPFile+'.csv'
2508zip_file = zipfile.ZipFile(InputZIPFile, 'r')
2509zip_file.extractall(OutputDir)
2510zip_file.close()
2511t=0
2512for dirname, dirnames, filenames in os.walk(OutputDir):
2513for filename in filenames:
2514sCSVFile = dirname + '/' + filename
2515t=t+1
2516if t==1:
2517NameRawData=pd.read_csv(sCSVFile,header=None,low_memory=False)
2518NameData=NameRawData
2519else:
2520NameRawData=pd.read_csv(sCSVFile,header=None,low_memory=False)
2521NameData=NameData.append(NameRawData)
2522NameData.rename(columns={0 : 'NameValues'},inplace=True)
2523NameData.to_csv(OutputFile, index = False)
2524shutil.rmtree(OutputDir)
2525print('Process: ',InputZIPFile)
2526#################################################################
2527print('### Done!! ############################################')
2528#################################################################
2529This generates three files named
2530Retrieve-Data_female-names.csv
2531Retrieve-Data_male-names.csv
2532Retrieve-Data_last-names.csv
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566Connecting to other Data Sources
2567
2568A. Program to connect to different data sources.
2569
2570SQLite:
2571######################################### #######################
2572# -*- coding: utf-8 -*-
2573################################################################
2574import sqlite3 as sq
2575import pandas as pd
2576################################################################
2577Base='C:/VKHCG'
2578sDatabaseName=Base + '/01-Vermeulen/00-RawData/SQLite/vermeulen.db'
2579conn = sq.connect(sDatabaseName)
2580################################################################
2581sFileName='C:/VKHCG/01-Vermeulen/01-Retrieve/01-EDS/02-Python/Retrieve_IP_DATA.csv'
2582print('Loading :',sFileName)
2583IP_DATA_ALL_FIX=pd.read_csv(sFileName,header=0,low_memory=False)
2584IP_DATA_ALL_FIX.index.names = ['RowIDCSV']
2585sTable='IP_DATA_ALL'
2586print('Storing :',sDatabaseName,' Table:',sTable)
2587IP_DATA_ALL_FIX.to_sql(sTable, conn, if_exists="replace")
2588print('Loading :',sDatabaseName,' Table:',sTable)
2589TestData=pd.read_sql_query("select * from IP_DATA_ALL;", conn)
2590print('################')
2591print('## Data Values')
2592print('################')
2593print(TestData)
2594print('################')
2595print('## Data Profile')
2596print('################')
2597print('Rows :',TestData.shape[0])
2598print('Columns :',TestData.shape[1])
2599print('################')
2600print('### Done!! ############################################')
2601
2602MySQL:
2603
2604Open MySql
2605Create a database “DataScience”
2606Create a python file and add the following code:
2607################ Connection With MySQL ######################
2608import mysql.connector
2609conn = mysql.connector.connect(host='localhost',
2610database='DataScience',
2611user='root',
2612password='root')
2613conn.connect
2614if(conn.is_connected):
2615print('###### Connection With MySql Established Successfullly ##### ')
2616else:
2617print('Not Connected -- Check Connection Properites')
2618
2619
2620
2621Microsoft Excel
2622
2623##################Retrieve-Country-Currency.py
2624################################################################
2625# -*- coding: utf-8 -*-
2626################################################################
2627import os
2628import pandas as pd
2629################################################################
2630Base='C:/VKHCG'
2631################################################################
2632sFileDir=Base + '/01-Vermeulen/01-Retrieve/01-EDS/02-Python'
2633#if not os.path.exists(sFileDir):
2634#os.makedirs(sFileDir)
2635################################################################
2636CurrencyRawData = pd.read_excel('C:/VKHCG/01-Vermeulen/00-RawData/Country_Currency.xlsx')
2637sColumns = ['Country or territory', 'Currency', 'ISO-4217']
2638CurrencyData = CurrencyRawData[sColumns]
2639CurrencyData.rename(columns={'Country or territory': 'Country', 'ISO-4217':
2640'CurrencyCode'}, inplace=True)
2641CurrencyData.dropna(subset=['Currency'],inplace=True)
2642CurrencyData['Country'] = CurrencyData['Country'].map(lambda x: x.strip())
2643CurrencyData['Currency'] = CurrencyData['Currency'].map(lambda x:
2644x.strip())
2645CurrencyData['CurrencyCode'] = CurrencyData['CurrencyCode'].map(lambda x:x.strip())
2646print(CurrencyData)
2647print('~~~~~~ Data from Excel Sheet Retrived Successfully ~~~~~~~ ')
2648################################################################
2649sFileName=sFileDir + '/Retrieve-Country-Currency.csv'
2650CurrencyData.to_csv(sFileName, index = False)
2651OUTPUT:
2652
2653
2654Practical 5
2655Aim:Assessing Data
2656
2657A. Perform error management on the given data using pandas package.
2658
2659
2660Python pandas package enables several automatic error-management features.
2661File Location: C:\VKHCG\01-Vermeulen\02-Assess
2662Missing Values in Pandas:
2663
2664i. Drop the Columns Where All Elements Are Missing Values
2665
2666Code :
2667
2668####Assess-Good-Bad-01.py####
2669# -*- coding: utf-8 -*-
2670################################################################
2671import sys
2672import os
2673import pandas as pd
2674################################################################
2675Base='C:/VKHCG'
2676################################################################
2677print('################################')
2678print('Working Base :',Base, ' using ', sys.platform)
2679print('################################')
2680################################################################
2681sInputFileName='Good-or-Bad.csv'
2682sOutputFileName='Good-or-Bad-01.csv'
2683Company='01-Vermeulen'
2684################################################################
2685Base='C:/VKHCG'
2686################################################################
2687sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
2688if not os.path.exists(sFileDir):
2689os.makedirs(sFileDir)
2690################################################################
2691### Import Warehouse
2692################################################################
2693sFileName=Base + '/' + Company + '/00-RawData/' + sInputFileName
2694print('Loading :',sFileName)
2695RawData=pd.read_csv(sFileName,header=0)
2696print('################################')
2697print('## Raw Data Values')
2698print('################################')
2699print(RawData)
2700print('################################')
2701print('## Data Profile')
2702print('################################')
2703print('Rows :',RawData.shape[0])
2704print('Columns :',RawData.shape[1])
2705print('################################')
2706################################################################
2707sFileName=sFileDir + '/' + sInputFileName
2708RawData.to_csv(sFileName, index = False)
2709################################################################
2710TestData=RawData.dropna(axis=1, how='all')
2711################################################################
2712print('################################')
2713print('## Test Data Values')
2714print('################################')
2715print(TestData)
2716print('################################')
2717print('## Data Profile')
2718print('################################')
2719print('Rows :',TestData.shape[0])
2720print('Columns :',TestData.shape[1])
2721print('################################')
2722################################################################
2723sFileName=sFileDir + '/' + sOutputFileName
2724TestData.to_csv(sFileName, index = False)
2725################################################################
2726print('################################')
2727print('### Done!! #####################')
2728print('################################')
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741Output:
2742
2743>>>
2744======= RESTART: C:\VKHCG\01-Vermeulen\02-Assess\Assess-Good-Bad-01.py =======
2745################################
2746Working Base : C:/VKHCG using win32
2747################################
2748Loading : C:/VKHCG/01-Vermeulen/00-RawData/Good-or-Bad.csv
2749################################
2750## Raw Data Values
2751################################
2752ID FieldA FieldB FieldC FieldD FieldE FieldF FieldG
2753
27540 1.0 Good Better Best 1024.0 NaN 10241.0 1
27551 2.0 Good NaN Best 512.0 NaN 5121.0 2
27562 3.0 Good Better NaN 256.0 NaN 256.0 3
27573 4.0 Good Better Best NaN NaN 211.0 4
27584 5.0 Good Better NaN 64.0 NaN 6411.0 5
27595 6.0 Good NaN Best 32.0 NaN 32.0 6
27606 7.0 NaN Better Best 16.0 NaN 1611.0 7
27617 8.0 NaN NaN Best 8.0 NaN 8111.0 8
27628 9.0 NaN NaN NaN 4.0 NaN 41.0 9
27639 10.0 A B C 2.0 NaN 21111.0 10
276410 NaN NaN NaN NaN NaN NaN NaN 11
276511 10.0 Good Better Best 1024.0 NaN 102411.0 12
276612 10.0 Good NaN Best 512.0 NaN 512.0 13
276713 10.0 Good Better NaN 256.0 NaN 1256.0 14
276814 10.0 Good Better Best NaN NaN NaN 15
276915 10.0 Good Better NaN 64.0 NaN 164.0 16
277016 10.0 Good NaN Best 32.0 NaN 322.0 17
277117 10.0 NaN Better Best 16.0 NaN 163.0 18
277218 10.0 NaN NaN Best 8.0 NaN 844.0 19
277319 10.0 NaN NaN NaN 4.0 NaN 4555.0 20
277420 10.0 A B C 2.0 NaN 111.0 21
2775################################
2776## Data Profile
2777################################
2778Rows : 21
2779Columns : 8
2780################################
2781################################
2782
2783## Test Data Values
2784################################
2785ID FieldA FieldB FieldC FieldD FieldF FieldG
27860 1.0 Good Better Best 1024.0 10241.0 1
27871 2.0 Good NaN Best 512.0 5121.0 2
27882 3.0 Good Better NaN 256.0 256.0 3
27893 4.0 Good Better Best NaN 211.0 4
27904 5.0 Good Better NaN 64.0 6411.0 5
27915 6.0 Good NaN Best 32.0 32.0 6
27926 7.0 NaN Better Best 16.0 1611.0 7
27937 8.0 NaN NaN Best 8.0 8111.0 8
27948 9.0 NaN NaN NaN 4.0 41.0 9
27959 10.0 A B C 2.0 21111.0 10
279610 NaN NaN NaN NaN NaN NaN 11
279711 10.0 Good Better Best 1024.0 102411.0 12
279812 10.0 Good NaN Best 512.0 512.0 13
279913 10.0 Good Better NaN 256.0 1256.0 14
280014 10.0 Good Better Best NaN NaN 15
280115 10.0 Good Better NaN 64.0 164.0 16
280216 10.0 Good NaN Best 32.0 322.0 17
280317 10.0 NaN Better Best 16.0 163.0 18
280418 10.0 NaN NaN Best 8.0 844.0 19
280519 10.0 NaN NaN NaN 4.0 4555.0 20
280620 10.0 A B C 2.0 111.0 21
2807################################
2808## Data Profile
2809################################
2810Rows : 21
2811Columns : 7
2812################################
2813################################
2814### Done!! #####################
2815################################
2816>>>
2817All of column E has been deleted, owing to the fact that all values in that column were missing values/errors.
2818
2819ii. Drop the Columns Where Any of the Elements Is Missing Values
2820
2821######Assess-Good-Bad 02.py#########
2822# -*- coding: utf-8 -*-
2823import sys
2824import os
2825import pandas as pd
2826################################################################
2827Base='C:/VKHCG'
2828sInputFileName='Good-or-Bad.csv'
2829sOutputFileName='Good-or-Bad-02.csv'
2830Company='01-Vermeulen'
2831################################################################
2832Base='C:/VKHCG'
2833################################################################
2834print('################################')
2835print('Working Base :',Base, ' using ', sys.platform)
2836print('################################')
2837################################################################
2838sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
2839if not os.path.exists(sFileDir):
2840os.makedirs(sFileDir)
2841################################################################
2842### Import Warehouse
2843################################################################
2844sFileName=Base + '/' + Company + '/00-RawData/' + sInputFileName
2845print('Loading :',sFileName)
2846RawData=pd.read_csv(sFileName,header=0)
2847print('################################')
2848print('## Raw Data Values')
2849print('################################')
2850print(RawData)
2851print('################################')
2852print('## Data Profile')
2853print('################################')
2854print('Rows :',RawData.shape[0])
2855print('Columns :',RawData.shape[1])
2856print('################################')
2857################################################################
2858sFileName=sFileDir + '/' + sInputFileName
2859RawData.to_csv(sFileName, index = False)
2860################################################################
2861TestData=RawData.dropna(axis=1, how='any')
2862################################################################
2863print('################################')
2864print('## Test Data Values')
2865print('################################')
2866print(TestData)
2867print('################################')
2868print('## Data Profile')
2869print('################################')
2870print('Rows :',TestData.shape[0])
2871print('Columns :',TestData.shape[1])
2872print('################################')
2873################################################################
2874sFileName=sFileDir + '/' + sOutputFileName
2875TestData.to_csv(sFileName, index = False)
2876################################################################
2877print('################################')
2878print('### Done!! #####################')
2879print('################################')
2880################################################################
2881>>>
2882
2883
2884
2885
2886
2887
2888
2889
2890Output:
2891======= RESTART: C:\VKHCG\01-Vermeulen\02-Assess\Assess-Good-Bad-02.py =======
2892################################
2893Working Base : C:/VKHCG using win32
2894################################
2895Loading : C:/VKHCG/01-Vermeulen/00-RawData/Good-or-Bad.csv
2896################################
2897## Raw Data Values
2898################################
2899ID FieldA FieldB FieldC FieldD FieldE FieldF FieldG
29000 1.0 Good Better Best 1024.0 NaN 10241.0 1
29011 2.0 Good NaN Best 512.0 NaN 5121.0 2
29022 3.0 Good Better NaN 256.0 NaN 256.0 3
29033 4.0 Good Better Best NaN NaN 211.0 4
29044 5.0 Good Better NaN 64.0 NaN 6411.0 5
29055 6.0 Good NaN Best 32.0 NaN 32.0 6
29066 7.0 NaN Better Best 16.0 NaN 1611.0 7
29077 8.0 NaN NaN Best 8.0 NaN 8111.0 8
29088 9.0 NaN NaN NaN 4.0 NaN 41.0 9
29099 10.0 A B C 2.0 NaN 21111.0 10
291010 NaN NaN NaN NaN NaN NaN NaN 11
291111 10.0 Good Better Best 1024.0 NaN 102411.0 12
291212 10.0 Good NaN Best 512.0 NaN 512.0 13
291313 10.0 Good Better NaN 256.0 NaN 1256.0 14
291414 10.0 Good Better Best NaN NaN NaN 15
291515 10.0 Good Better NaN 64.0 NaN 164.0 16
291616 10.0 Good NaN Best 32.0 NaN 322.0 17
291717 10.0 NaN Better Best 16.0 NaN 163.0 18
291818 10.0 NaN NaN Best 8.0 NaN 844.0 19
291919 10.0 NaN NaN NaN 4.0 NaN 4555.0 20
292020 10.0 A B C 2.0 NaN 111.0 21
2921################################
2922## Data Profile
2923################################
2924Rows : 21
2925Columns : 8
2926################################
2927################################
2928## Test Data Values
2929################################
2930FieldG
29310 1
29321 2
29332 3
29343 4
29354 5
29365 6
29376 7
29387 8
29398 9
29409 10
294110 11
294211 12
294312 13
294413 14
294514 15
294615 16
294716 17
294817 18
294918 19
295019 20
295120 21
2952################################
2953## Data Profile
2954################################
2955Rows : 21
2956Columns : 1
2957################################
2958################################
2959### Done!! #####################
2960################################
2961>>>
2962
2963iii. Keep Only the Rows That Contain a Maximum of Two Missing Values
2964
2965#############Assess-Good-Bad-03.py ################
2966# -*- coding: utf-8 -*-
2967################################################################
2968import sys
2969import os
2970import pandas as pd
2971################################################################
2972sInputFileName='Good-or-Bad.csv'
2973sOutputFileName='Good-or-Bad-03.csv'
2974Company='01-Vermeulen'
2975Base='C:/VKHCG'
2976################################################################
2977print('################################')
2978print('Working Base :',Base, ' using Windows ~~~~')
2979print('################################')
2980################################################################
2981sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
2982if not os.path.exists(sFileDir):
2983os.makedirs(sFileDir)
2984################################################################
2985### Import Warehouse
2986################################################################
2987sFileName=Base + '/' + Company + '/00-RawData/' + sInputFileName
2988print('Loading :',sFileName)
2989RawData=pd.read_csv(sFileName,header=0)
2990print('################################')
2991print('## Raw Data Values')
2992print('################################')
2993print(RawData)
2994print('################################')
2995print('## Data Profile')
2996print('################################')
2997print('Rows :',RawData.shape[0])
2998print('Columns :',RawData.shape[1])
2999print('################################')
3000################################################################
3001sFileName=sFileDir + '/' + sInputFileName
3002RawData.to_csv(sFileName, index = False)
3003################################################################
3004TestData=RawData.dropna(thresh=2)
3005print('################################')
3006print('## Test Data Values')
3007print('################################')
3008print(TestData)
3009print('################################')
3010print('## Data Profile')
3011print('################################')
3012print('Rows :',TestData.shape[0])
3013print('Columns :',TestData.shape[1])
3014print('################################')
3015sFileName=sFileDir + '/' + sOutputFileName
3016TestData.to_csv(sFileName, index = False)
3017################################################################
3018print('################################')
3019print('### Done!! #####################')
3020print('###############################
3021
3022before:-
3023
3024
3025
3026
3027After :
3028
3029
3030
3031Row with more than two missing values got deleted.
3032
3033
3034The next step along the route is to generate a full network routing solution for the company, to resolve the data issues in the retrieve data.
3035
3036
3037
3038
3039
3040
3041
3042
3043
3044
3045
3046
3047
3048
3049
3050
3051
3052
3053
3054
3055
3056
3057
3058
3059
3060B. Write Python / R program to create the network routing diagram from the given data on routers.
3061
3062####### Assess-Network-Routing Company.py########
3063import sys
3064import os
3065import pandas as pd
3066################################################################
3067pd.options.mode.chained_assignment = None
3068################################################################
3069Base='C:/VKHCG'
3070################################################################
3071print('################################')
3072print('Working Base :',Base, ' using Windows')
3073print('################################')
3074################################################################
3075sInputFileName1='01-Retrieve/01-EDS/01-R/Retrieve_Country_Code.csv'
3076sInputFileName2='01-Retrieve/01-EDS/02-Python/Retrieve_Router_Location.csv'
3077sInputFileName3='01-Retrieve/01-EDS/01-R/Retrieve_IP_DATA.csv'
3078################################################################
3079sOutputFileName='Assess-Network-Routing-Company.csv'
3080Company='01-Vermeulen'
3081################################################################
3082################################################################
3083### Import Country Data
3084################################################################
3085sFileName=Base + '/' + Company + '/' + sInputFileName1
3086print('################################')
3087print('Loading :',sFileName)
3088print('################################')
3089CountryData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3090print('Loaded Country:',CountryData.columns.values)
3091print('################################')
3092################################################################
3093## Assess Country Data
3094################################################################
3095print('################################')
3096print('Changed :',CountryData.columns.values)
3097CountryData.rename(columns={'Country': 'Country_Name'}, inplace=True)
3098CountryData.rename(columns={'ISO-2-CODE': 'Country_Code'}, inplace=True)
3099CountryData.drop('ISO-M49', axis=1, inplace=True)
3100CountryData.drop('ISO-3-Code', axis=1, inplace=True)
3101CountryData.drop('RowID', axis=1, inplace=True)
3102print('To :',CountryData.columns.values)
3103print('################################')
3104################################################################
3105################################################################
3106### Import Company Data
3107################################################################
3108sFileName=Base + '/' + Company + '/' + sInputFileName2
3109print('################################')
3110print('Loading :',sFileName)
3111print('################################')
3112CompanyData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3113print('Loaded Company :',CompanyData.columns.values)
3114print('################################')
3115################################################################
3116## Assess Company Data
3117################################################################
3118print('################################')
3119print('Changed :',CompanyData.columns.values)
3120CompanyData.rename(columns={'Country': 'Country_Code'}, inplace=True)
3121print('To :',CompanyData.columns.values)
3122print('################################')
3123################################################################
3124################################################################
3125### Import Customer Data
3126################################################################
3127sFileName=Base + '/' + Company + '/' + sInputFileName3
3128print('################################')
3129print('Loading :',sFileName)
3130print('################################')
3131CustomerRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3132print('################################')
3133print('Loaded Customer :',CustomerRawData.columns.values)
3134print('################################')
3135################################################################
3136CustomerData=CustomerRawData.dropna(axis=0, how='any')
3137print('################################')
3138print('Remove Blank Country Code')
3139print('Reduce Rows from', CustomerRawData.shape[0],' to ', CustomerData.shape[0])
3140print('################################')
3141################################################################
3142print('################################')
3143print('Changed :',CustomerData.columns.values)
3144CustomerData.rename(columns={'Country': 'Country_Code'}, inplace=True)
3145print('To :',CustomerData.columns.values)
3146print('################################')
3147################################################################
3148print('################################')
3149print('Merge Company and Country Data')
3150print('################################')
3151CompanyNetworkData=pd.merge(CompanyData,
3152CountryData,
3153how='inner',
3154on='Country_Code'
3155)
3156################################################################
3157print('################################')
3158print('Change ',CompanyNetworkData.columns.values)
3159for i in CompanyNetworkData.columns.values:
3160j='Company_'+i
3161CompanyNetworkData.rename(columns={i: j}, inplace=True)
3162print('To ', CompanyNetworkData.columns.values)
3163print('################################')
3164################################################################
3165################################################################
3166sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
3167if not os.path.exists(sFileDir):
3168os.makedirs(sFileDir)
3169################################################################
3170sFileName=sFileDir + '/' + sOutputFileName
3171print('################################')
3172print('Storing :', sFileName)
3173print('################################')
3174CompanyNetworkData.to_csv(sFileName, index = False, encoding="latin-1")
3175################################################################
3176################################################################
3177print('################################')
3178print('### Done!! #####################')
3179print('################################')
3180################################################################
3181
3182Output:
3183Go to C:\VKHCG\01-Vermeulen\02-Assess\01-EDS\02-Python folder and open
3184Assess-Network-Routing-Company.csv
3185
3186
3187
3188
3189
3190
3191
3192
3193
3194
3195
3196
3197
3198
3199
3200
3201
3202
3203
3204Next, Access the the customers location using network router location.
3205
3206##########Assess-Network-Routing-Customer.py ######################
3207
3208################################################################
3209import sys
3210import os
3211import pandas as pd
3212################################################################
3213pd.options.mode.chained_assignment = None
3214################################################################
3215if sys.platform == 'linux':
3216Base=os.path.expanduser('~') + 'VKHCG'
3217else:
3218Base='C:/VKHCG'
3219print('################################')
3220print('Working Base:',Base,'using',sys.platform)
3221print('################################')
3222################################################################
3223sInputFileName1='01-Retrieve/01-EDS/01-R/Retrieve_Country_Code.csv'
3224sInputFileName2='01-Retrieve/01-EDS/02-Python/Retrieve_All_Router_Location.csv'
3225################################################################
3226sOutputFileName='Assess-Network-Routing-Customer.csv'
3227Company='01-Vermeulen'
3228################################################################
3229################################################################
3230### Import Country Data
3231################################################################
3232sFileName=Base + '/' + Company + '/' + sInputFileName1
3233print('################################')
3234print('Loading:',sFileName)
3235print('################################')
3236CountryData=pd.read_csv(sFileName,header=0,low_memory=False,
3237encoding="latin-1")
3238print('Loaded Country:',CountryData.columns.values)
3239print('################################')
3240################################################################
3241## Assess Country Data
3242################################################################
3243print('################################')
3244print('Changed:',CountryData.columns.values)
3245CountryData.rename(columns={'Country': 'Country_Name'}, inplace=True)
3246CountryData.rename(columns={'ISO-2-CODE': 'Country_Code'}, inplace=True)
3247CountryData.drop('ISO-M49', axis=1, inplace=True)
3248CountryData.drop('ISO-3-Code', axis=1, inplace=True)
3249CountryData.drop('RowID', axis=1, inplace=True)
3250print('To:',CountryData.columns.values)
3251print('################################')
3252################################################################
3253### Import Customer Data
3254################################################################
3255sFileName=Base + '/' + Company + '/' + sInputFileName2
3256print('################################')
3257print('Loading:',sFileName)
3258print('################################')
3259CustomerRawData=pd.read_csv(sFileName,header=0,low_memory=False,
3260encoding="latin-1")
3261print('################################')
3262print('Loaded Customer:',CustomerRawData.columns.values)
3263print('################################')
3264################################################################
3265CustomerData=CustomerRawData.dropna(axis=0, how='any')
3266print('################################')
3267print('Remove Blank Country Code')
3268print('Reduce Rows from',CustomerRawData.shape[0],'to',CustomerData.
3269shape[0])
3270print('################################')
3271################################################################
3272print('################################')
3273print('Changed:',CustomerData.columns.values)
3274CustomerData.rename(columns={'Country': 'Country_Code'}, inplace=True)
3275print('To:',CustomerData.columns.values)
3276print('################################')
3277################################################################
3278print('################################')
3279print('Merge Customer and Country Data')
3280print('################################')
3281CustomerNetworkData=pd.merge(
3282CustomerData,
3283CountryData,
3284how='inner',
3285on='Country_Code'
3286)
3287################################################################
3288print('################################')
3289print('Change',CustomerNetworkData.columns.values)
3290for i in CustomerNetworkData.columns.values:
3291j='Customer_'+i
3292CustomerNetworkData.rename(columns={i:j}, inplace=True)
3293print('To', CustomerNetworkData.columns.values)
3294print('################################')
3295################################################################
3296sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
3297if not os.path.exists(sFileDir):
3298os.makedirs(sFileDir)
3299################################################################
3300sFileName=sFileDir + '/' + sOutputFileName
3301print('################################')
3302print('Storing:', sFileName)
3303print('################################')
3304CustomerNetworkData.to_csv(sFileName, index = False, encoding="latin-1")
3305################################################################
3306print('################################')
3307print('### Done!! #####################')
3308print('################################')
3309
3310
3311
3312Assess-Network-Routing-Node.py
3313################################################################
3314import sys
3315import os
3316import pandas as pd
3317################################################################
3318pd.options.mode.chained_assignment = None
3319################################################################
3320Base='C:/VKHCG'
3321################################################################
3322print('################################')
3323print('Working Base :',Base, ' using ', sys.platform)
3324print('################################')
3325################################################################
3326sInputFileName='01-Retrieve/01-EDS/02-Python/Retrieve_IP_DATA.csv'
3327################################################################
3328sOutputFileName='Assess-Network-Routing-Node.csv'
3329Company='01-Vermeulen'
3330################################################################
3331### Import IP Data
3332################################################################
3333sFileName=Base + '/' + Company + '/' + sInputFileName
3334print('################################')
3335print('Loading :',sFileName)
3336print('################################')
3337IPData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3338print('Loaded IP :', IPData.columns.values)
3339print('################################')
3340################################################################
3341print('################################')
3342print('Changed :',IPData.columns.values)
3343IPData.drop('RowID', axis=1, inplace=True)
3344IPData.drop('ID', axis=1, inplace=True)
3345IPData.rename(columns={'Country': 'Country_Code'}, inplace=True)
3346IPData.rename(columns={'Place.Name': 'Place_Name'}, inplace=True)
3347IPData.rename(columns={'Post.Code': 'Post_Code'}, inplace=True)
3348IPData.rename(columns={'First.IP.Number': 'First_IP_Number'}, inplace=True)
3349IPData.rename(columns={'Last.IP.Number': 'Last_IP_Number'}, inplace=True)
3350print('To :',IPData.columns.values)
3351print('################################')
3352################################################################
3353print('################################')
3354print('Change ',IPData.columns.values)
3355for i in IPData.columns.values:
3356j='Node_'+i
3357IPData.rename(columns={i: j}, inplace=True)
3358print('To ', IPData.columns.values)
3359print('################################')
3360################################################################
3361sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
3362if not os.path.exists(sFileDir):
3363os.makedirs(sFileDir)
3364################################################################
3365sFileName=sFileDir + '/' + sOutputFileName
3366print('################################')
3367print('Storing :', sFileName)
3368print('################################')
3369IPData.to_csv(sFileName, index = False, encoding="latin-1")
3370################################################################
3371print('################################')
3372print('### Done!! #####################')
3373print('################################')
3374################################################################
3375
3376
3377
3378
3379
3380
3381
3382
3383
3384
3385
3386
3387
3388
3389
3390
3391
3392Output:
3393C:/VKHCG/01-Vermeulen/02-Assess/01-EDS/02-Python/Assess-Network-Routing-Node.csv
3394
3395
3396
3397
3398
3399
3400
3401
3402
3403
3404
3405
3406
3407
3408
3409
3410
3411
3412
3413
3414
3415
3416
3417
3418
3419
3420
3421
3422
3423
3424
3425
3426Directed Acyclic Graph (DAG)
3427A directed acyclic graph is a specific graph that only has one path through the graph.
3428
3429
3430C. Write a Python / R program to build directed acyclic graph.
3431
3432Open your python editor and create a file named
3433Assess-DAG-Location.py in directory
3434C:\VKHCG\01-Vermeulen\02-Assess
3435################################################################
3436import networkx as nx
3437import matplotlib.pyplot as plt
3438import sys
3439import os
3440import pandas as pd
3441################################################################
3442Base='C:/VKHCG'
3443################################################################
3444print('################################')
3445print('Working Base :',Base, ' using ', sys.platform)
3446print('################################')
3447################################################################
3448sInputFileName='01-Retrieve/01-EDS/02-Python/Retrieve_Router_Location.csv'
3449sOutputFileName1='Assess-DAG-Company-Country.png'
3450sOutputFileName2='Assess-DAG-Company-Country-Place.png'
3451Company='01-Vermeulen'
3452################################################################
3453### Import Company Data
3454################################################################
3455sFileName=Base + '/' + Company + '/' + sInputFileName
3456print('################################')
3457print('Loading :',sFileName)
3458print('################################')
3459CompanyData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3460print('Loaded Company :',CompanyData.columns.values)
3461print('################################')
3462################################################################
3463print(CompanyData)
3464print('################################')
3465print('Rows : ',CompanyData.shape[0])
3466print('################################')
3467################################################################
3468G1=nx.DiGraph()
3469G2=nx.DiGraph()
3470################################################################
3471for i in range(CompanyData.shape[0]):
3472G1.add_node(CompanyData['Country'][i])
3473sPlaceName= CompanyData['Place_Name'][i] + '-' + CompanyData['Country'][i]
3474G2.add_node(sPlaceName)
3475print('################################')
3476for n1 in G1.nodes():
3477for n2 in G1.nodes():
3478if n1 != n2:
3479print('Link :',n1,' to ', n2)
3480G1.add_edge(n1,n2)
3481print('################################')
3482print('################################')
3483print("Nodes of graph: ")
3484print(G1.nodes())
3485print("Edges of graph: ")
3486print(G1.edges())
3487print('################################')
3488################################################################
3489sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
3490if not os.path.exists(sFileDir):
3491os.makedirs(sFileDir)
3492################################################################
3493sFileName=sFileDir + '/' + sOutputFileName1
3494print('################################')
3495print('Storing :', sFileName)
3496print('################################')
3497nx.draw(G1,pos=nx.spectral_layout(G1),
3498nodecolor='r',edge_color='g',
3499with_labels=True,node_size=8000,
3500font_size=12)
3501plt.savefig(sFileName) # save as png
3502plt.show() # display
3503################################################################
3504print('################################')
3505for n1 in G2.nodes():
3506for n2 in G2.nodes():
3507if n1 != n2:
3508print('Link :',n1,' to ', n2)
3509G2.add_edge(n1,n2)
3510print('################################')
3511print('################################')
3512print("Nodes of graph: ")
3513print(G2.nodes())
3514print("Edges of graph: ")
3515print(G2.edges())
3516print('################################')
3517################################################################
3518sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
3519if not os.path.exists(sFileDir):
3520os.makedirs(sFileDir)
3521################################################################
3522sFileName=sFileDir + '/' + sOutputFileName2
3523print('################################')
3524print('Storing :', sFileName)
3525print('################################')
3526nx.draw(G2,pos=nx.spectral_layout(G2),
3527nodecolor='r',edge_color='b',
3528with_labels=True,node_size=8000,
3529font_size=12)
3530plt.savefig(sFileName) # save as png
3531plt.show() # display
3532################################################################
3533Output:
3534################################
3535Rows : 150
3536################################
3537################################
3538Link : US to DE
3539Link : US to GB
3540Link : DE to US
3541Link : DE to GB
3542Link : GB to US
3543Link : GB to DE
3544################################
3545################################
3546Nodes of graph:
3547['US', 'DE', 'GB']
3548Edges of graph:
3549[('US', 'DE'), ('US', 'GB'), ('DE', 'US'), ('DE', 'GB'), ('GB', 'US'), ('GB', 'DE')]
3550################################
3551
3552
3553
3554################################
3555Link : New York-US to Munich-DE
3556Link : New York-US to London-GB
3557Link : Munich-DE to New York-US
3558Link : Munich-DE to London-GB
3559Link : London-GB to New York-US
3560Link : London-GB to Munich-DE
3561################################
3562################################
3563Nodes of graph:
3564['New York-US', 'Munich-DE', 'London-GB']
3565Edges of graph:
3566[('New York-US', 'Munich-DE'), ('New York-US', 'London-GB'), ('Munich-DE', 'New York-US'),
3567('Munich-DE', 'London-GB'), ('London-GB', 'New York-US'), ('London-GB', 'Munich-DE')]
3568
3569
3570
3571
3572
3573
3574
3575
3576
3577
3578
3579
3580
3581
3582
3583
3584
3585
3586
3587
3588
3589
3590
3591
3592
3593
3594
3595
3596
3597
3598
3599
3600
3601
3602
3603
3604D. write a python code to create a complex graph
3605Open your Python editor and create a file named
3606Assess-DAG-GPS.py in directory
3607C:\VKHCG\01-Vermeulen\02-Assess.
3608
3609
3610import networkx as nx
3611import matplotlib.pyplot as plt
3612import sys
3613import os
3614import pandas as pd
3615Base='C:/VKHCG'
3616print('################################')
3617print('Working Base :',Base, ' using ', sys.platform)
3618print('################################')
3619sInputFileName='01-Retrieve/01-EDS/02-Python/Retrieve_Router_Location.csv'
3620sOutputFileName='Assess-DAG-Company-GPS.png'
3621Company='01-Vermeulen'
3622### Import Company Data
3623sFileName=Base + '/' + Company + '/' + sInputFileName
3624print('################################')
3625print('Loading :',sFileName)
3626print('################################')
3627CompanyData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3628print('Loaded Company :',CompanyData.columns.values)
3629print('################################')
3630print(CompanyData)
3631print('################################')
3632print('Rows : ',CompanyData.shape[0])
3633print('################################')
3634G=nx.Graph()
3635for i in range(CompanyData.shape[0]):
3636nLatitude=round(CompanyData['Latitude'][i],2)
3637nLongitude=round(CompanyData['Longitude'][i],2)
3638if nLatitude < 0:
3639sLatitude = str(nLatitude*-1) + ' S'
3640else:
3641sLatitude = str(nLatitude) + ' N'
3642if nLongitude < 0:
3643sLongitude = str(nLongitude*-1) + ' W'
3644else:
3645sLongitude = str(nLongitude) + ' E'
3646sGPS= sLatitude + '-' + sLongitude
3647G.add_node(sGPS)
3648print('################################')
3649for n1 in G.nodes():
3650for n2 in G.nodes():
3651if n1 != n2:
3652print('Link :',n1,' to ', n2)
3653G.add_edge(n1,n2)
3654print('################################')
3655print('################################')
3656print("Nodes of graph: ")
3657print(G.number_of_nodes())
3658print("Edges of graph: ")
3659print(G.number_of_edges())
3660print('################################')
3661Output:
3662=== RESTART: C:\VKHCG\01-Vermeulen\02-Assess\Assess-DAG-GPS-unsmoothed.py ===
3663################################
3664Working Base : C:/VKHCG using win32
3665################################
3666Loading : C:/VKHCG/01-Vermeulen/01-Retrieve/01-EDS/02-Python/Retrieve_Router_Location.csv
3667################################
3668Loaded Company : ['Country' 'Place_Name' 'Latitude' 'Longitude']
3669################################
3670Country Place_Name Latitude Longitude
36710 US New York 40.7528 -73.9725
36721 US New York 40.7214 -74.0052
3673-
3674-
3675-
3676Link : 48.15 N-11.74 E to 48.15 N-11.46 E
3677Link : 48.15 N-11.74 E to 48.09 N-11.54 E
3678Link : 48.15 N-11.74 E to 48.18 N-11.75 E
3679Link : 48.15 N-11.74 E to 48.1 N-11.47 E
3680################################
3681Nodes of graph:
3682117
3683Edges of graph:
36846786
3685################################
3686>>>
3687
3688
3689
3690
3691KRENNWALLNER
3692
3693E. Write a Python / R program to pick the content for Bill Boards from the given data calculate the
3694number of visitors per day from the range of IP addresses that access the billboards in
3695Germany.
3696
3697Picking Content for Billboards
3698The basic process required is to combine two sets of data and then calculate the number of visitors
3699per day from the range of IP addresses that access the billboards in Germany.
3700Bill Board Location: Rows - 8873
3701Access Visitors: Rows - 75999
3702Access Location Record: Rows – 1,81,235
3703Open Python editor and create a file named Assess-DE-Billboard.py in directory
3704C:\VKHCG\02-Krennwallner\02-Assess
3705################# Assess-DE-Billboard.py######################
3706import sys
3707import os
3708import sqlite3 as sq
3709import pandas as pd
3710################################################################
3711Base='C:/VKHCG'
3712print('################################')
3713print('Working Base :',Base, ' using ', sys.platform)
3714print('################################')
3715################################################################
3716sInputFileName1='01-Retrieve/01-EDS/02-Python/Retrieve_DE_Billboard_Locations.csv'
3717sInputFileName2='01-Retrieve/01-EDS/02-Python/Retrieve_Online_Visitor.csv'
3718sOutputFileName='Assess-DE-Billboard-Visitor.csv'
3719Company='02-Krennwallner'
3720################################################################
3721sDataBaseDir=Base + '/' + Company + '/02-Assess/SQLite'
3722if not os.path.exists(sDataBaseDir):
3723os.makedirs(sDataBaseDir)
3724################################################################
3725sDatabaseName=sDataBaseDir + '/krennwallner.db'
3726conn = sq.connect(sDatabaseName)
3727################################################################
3728### Import Billboard Data
3729################################################################
3730sFileName=Base + '/' + Company + '/' + sInputFileName1
3731print('################################')
3732print('Loading :',sFileName)
3733print('################################')
3734BillboardRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3735BillboardRawData.drop_duplicates(subset=None, keep='first', inplace=True)
3736BillboardData=BillboardRawData
3737print('Loaded Company :',BillboardData.columns.values)
3738print('################################')
3739################################################################
3740print('################')
3741sTable='Assess_BillboardData'
3742print('Storing :',sDatabaseName,' Table:',sTable)
3743BillboardData.to_sql(sTable, conn, if_exists="replace")
3744print('################')
3745################################################################
3746print(BillboardData.head())
3747print('################################')
3748print('Rows : ',BillboardData.shape[0])
3749print('################################')
3750################################################################
3751### Import visitor Data
3752################################################################
3753sFileName=Base + '/' + Company + '/' + sInputFileName2
3754print('################################')
3755print('Loading :',sFileName)
3756print('################################')
3757VisitorRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
3758VisitorRawData.drop_duplicates(subset=None, keep='first', inplace=True)
3759VisitorData=VisitorRawData[VisitorRawData.Country=='DE']
3760print('Loaded Company :',VisitorData.columns.values)
3761print('################################')
3762################################################################
3763print('################')
3764sTable='Assess_VisitorData'
3765print('Storing :',sDatabaseName,' Table:',sTable)
3766VisitorData.to_sql(sTable, conn, if_exists="replace")
3767print('################')
3768################################################################
3769print(VisitorData.head())
3770print('################################')
3771print('Rows : ',VisitorData.shape[0])
3772print('################################')
3773################################################################
3774print('################')
3775sTable='Assess_BillboardVisitorData'
3776print('Loading :',sDatabaseName,' Table:',sTable)
3777sSQL="select distinct"
3778sSQL=sSQL+ " A.Country AS BillboardCountry,"
3779sSQL=sSQL+ " A.Place_Name AS BillboardPlaceName,"
3780sSQL=sSQL+ " A.Latitude AS BillboardLatitude, "
3781sSQL=sSQL+ " A.Longitude AS BillboardLongitude,"
3782sSQL=sSQL+ " B.Country AS VisitorCountry,"
3783sSQL=sSQL+ " B.Place_Name AS VisitorPlaceName,"
3784sSQL=sSQL+ " B.Latitude AS VisitorLatitude, "
3785sSQL=sSQL+ " B.Longitude AS VisitorLongitude,"
3786sSQL=sSQL+ " (B.Last_IP_Number - B.First_IP_Number) * 365.25 * 24 * 12 AS VisitorYearRate"
3787sSQL=sSQL+ " from"
3788sSQL=sSQL+ " Assess_BillboardData as A"
3789sSQL=sSQL+ " JOIN "
3790sSQL=sSQL+ " Assess_VisitorData as B"
3791sSQL=sSQL+ " ON "
3792sSQL=sSQL+ " A.Country = B.Country"
3793sSQL=sSQL+ " AND "
3794sSQL=sSQL+ " A.Place_Name = B.Place_Name;"
3795BillboardVistorsData=pd.read_sql_query(sSQL, conn)
3796print('################')
3797################################################################
3798print('################')
3799sTable='Assess_BillboardVistorsData'
3800print('Storing :',sDatabaseName,' Table:',sTable)
3801BillboardVistorsData.to_sql(sTable, conn, if_exists="replace")
3802print('################')
3803################################################################
3804print(BillboardVistorsData.head())
3805print('################################')
3806print('Rows : ',BillboardVistorsData.shape[0])
3807print('################################')
3808################################################################
3809sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
3810if not os.path.exists(sFileDir):
3811os.makedirs(sFileDir)
3812################################################################
3813print('################################')
3814print('Storing :', sFileName)
3815print('################################')
3816sFileName=sFileDir + '/' + sOutputFileName
3817BillboardVistorsData.to_csv(sFileName, index = False)
3818print('################################')
3819################################################################
3820print('### Done!! ############################################')
3821
3822
3823
3824
3825
3826
3827
3828
3829
3830
3831
3832
3833
3834
3835
3836
3837
3838
3839
3840
3841
3842
3843Output:
3844C:\VKHCG\02-Krennwallner\01-Retrieve\01-EDS\02-Python\Retrieve_Online_Visitor.csv
3845containing, 10,48,576(Ten lack Forty Eight Thousand Five Hundred and Seventy Six )rows.
3846
3847
3848
3849SQLite Visitor’s Database
3850C:/VKHCG/02-Krennwallner/02-Assess/SQLite/krennwallner.db Table:
3851BillboardCountry BillboardPlaceName ... VisitorLongitude VisitorYearRate
38520 DE Lake ... 8.5667 26823960.0
38531 DE Horb ... 8.6833 26823960.0
38542 DE Horb ... 8.6833 53753112.0
38553 DE Horb ... 8.6833 107611416.0
38564 DE Horb ... 8.6833 13359384.0
3857
3858
3859
3860
3861
3862
3863
3864
3865
3866
3867
3868
3869
3870
3871
3872
3873HILLMAN
3874
3875F. Write a Python / R program to plan the locations of the warehouses from the given data.
3876Planning the Locations of the Warehouses
3877Planning the location of the warehouses requires the assessment of the GPS locations of these warehouses
3878against the requirements for Hillman’s logistics needs.
3879Open your editor and create a file named Assess-Warehouse-Address.py in directory
3880C:\VKHCG\03-Hillman\02-Assess.
3881################## Assess-Warehouse-Address.py ###################
3882# -*- coding: utf-8 -*-
3883################################################################
3884import os
3885import pandas as pd
3886from geopy.geocoders import Nominatim
3887geolocator = Nominatim()
3888################################################################
3889InputDir='01-Retrieve/01-EDS/01-R'
3890InputFileName='Retrieve_GB_Postcode_Warehouse.csv'
3891EDSDir='02-Assess/01-EDS'
3892OutputDir=EDSDir + '/02-Python'
3893OutputFileName='Assess_GB_Warehouse_Address.csv'
3894Company='03-Hillman'
3895################################################################
3896Base='C:/VKHCG'
3897print('################################')
3898print('Working Base :',Base, ' using Windows')
3899print('################################')
3900################################################################
3901sFileDir=Base + '/' + Company + '/' + EDSDir
3902if not os.path.exists(sFileDir):
3903os.makedirs(sFileDir)
3904################################################################
3905sFileDir=Base + '/' + Company + '/' + OutputDir
3906if not os.path.exists(sFileDir):
3907os.makedirs(sFileDir)
3908################################################################
3909sFileName=Base + '/' + Company + '/' + InputDir + '/' + InputFileName
3910print('###########')
3911print('Loading :',sFileName)
3912Warehouse=pd.read_csv(sFileName,header=0,low_memory=False)
3913Warehouse.sort_values(by='postcode', ascending=1)
3914################################################################
3915## Limited to 10 due to service limit on Address Service.
3916################################################################
3917WarehouseGoodHead=Warehouse[Warehouse.latitude != 0].head(5)
3918WarehouseGoodTail=Warehouse[Warehouse.latitude != 0].tail(5)
3919################################################################
3920WarehouseGoodHead['Warehouse_Point']=WarehouseGoodHead.apply(lambda row:
3921(str(row['latitude'])+','+str(row['longitude']))
3922,axis=1)
3923WarehouseGoodHead['Warehouse_Address']=WarehouseGoodHead.apply(lambda row:
3924geolocator.reverse(row['Warehouse_Point']).address
3925,axis=1)
3926WarehouseGoodHead.drop('Warehouse_Point', axis=1, inplace=True)
3927WarehouseGoodHead.drop('id', axis=1, inplace=True)
3928WarehouseGoodHead.drop('postcode', axis=1, inplace=True)
3929################################################################
3930WarehouseGoodTail['Warehouse_Point']=WarehouseGoodTail.apply(lambda row:
3931(str(row['latitude'])+','+str(row['longitude']))
3932,axis=1)
3933WarehouseGoodTail['Warehouse_Address']=WarehouseGoodTail.apply(lambda row:
3934geolocator.reverse(row['Warehouse_Point']).address
3935,axis=1)
3936WarehouseGoodTail.drop('Warehouse_Point', axis=1, inplace=True)
3937WarehouseGoodTail.drop('id', axis=1, inplace=True)
3938WarehouseGoodTail.drop('postcode', axis=1, inplace=True)
3939################################################################
3940WarehouseGood=WarehouseGoodHead.append(WarehouseGoodTail, ignore_index=True)
3941print(WarehouseGood)
3942################################################################
3943sFileName=sFileDir + '/' + OutputFileName
3944WarehouseGood.to_csv(sFileName, index = False)
3945#################################################################
3946print('### Done!! ############################################')
3947#################################################################
3948Output:
3949
3950
3951
3952
3953
3954
3955
3956
3957
3958
3959
3960
3961
3962
3963G. Write a Python / R program for Global New Warehouse:
3964
3965Hillman wants to add extra global warehouses, and you are required to assess where they should be located. We only have to collect the possible locations for warehouses.
3966The following example will show you how to modify the data columns you read in that are totally ambiguous.
3967
3968Open Python editor and create a file named Assess-Warehouse-Global.py in directory
3969C:\VKHCG\03-Hillman\02-Assess
3970################# Assess-Warehouse-Global.py##############
3971# -*- coding: utf-8 -*-
3972################################################################
3973import sys
3974import os
3975import pandas as pd
3976################################################################
3977Base='C:/VKHCG'
3978################################################################
3979print('################################')
3980print('Working Base :',Base, ' using ', sys.platform)
3981print('################################')
3982################################################################
3983Company='03-Hillman'
3984InputDir='01-Retrieve/01-EDS/01-R'
3985InputFileName='Retrieve_All_Countries.csv'
3986EDSDir='02-Assess/01-EDS'
3987OutputDir=EDSDir + '/02-Python'
3988OutputFileName='Assess_All_Warehouse.csv'
3989################################################################
3990sFileDir=Base + '/' + Company + '/' + EDSDir
3991if not os.path.exists(sFileDir):
3992os.makedirs(sFileDir)
3993################################################################
3994sFileDir=Base + '/' + Company + '/' + OutputDir
3995if not os.path.exists(sFileDir):
3996os.makedirs(sFileDir)
3997################################################################
3998sFileName=Base + '/' + Company + '/' + InputDir + '/' + InputFileName
3999print('###########')
4000print('Loading :',sFileName)
4001Warehouse=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4002################################################################
4003sColumns={'X1' : 'Country',
4004'X2' : 'PostCode',
4005'X3' : 'PlaceName',
4006'X4' : 'AreaName',
4007'X5' : 'AreaCode',
4008'X10' : 'Latitude',
4009'X11' : 'Longitude'}
4010Warehouse.rename(columns=sColumns,inplace=True)
4011WarehouseGood=Warehouse
4012################################################################
4013sFileName=sFileDir + '/' + OutputFileName
4014WarehouseGood.to_csv(sFileName, index = False)
4015#################################################################
4016print('### Done!! ############################################')
4017#################################################################
4018This will produce a set of demonstrated values onscreen, plus a graph data file named
4019Assess_All_Warehouse.csv.
4020Output:
4021
4022
4023Open Assess0_All_Warehose.csv from C:\VKHCG\03-Hillman\02-Assess\01-EDS\02-Python
4024
4025
4026
4027
4028
4029
4030
4031
4032
4033
4034
4035
4036
4037
4038Clark Ltd
4039Clark Ltd is the accountancy company that handles everything related to the VKHCG’s finances and
4040personnel. Let’s investigate Clark with new knowledge.
4041K. Write a Python program to create Simple forex trading planner from the given data.
4042Simple Forex Trading Planner
4043Clark requires the assessment of the group’s forex data, for processing and data qualityissues. I will guide you
4044through an example of a forex solution.
4045Open your Python editor and create a file named Assess-Forex.py in directory
4046C:\VKHCG\04-Clark\02-Assess.
4047################################################################
4048import sys
4049import os
4050import sqlite3 as sq
4051import pandas as pd
4052################################################################
4053Base='C:/VKHCG'
4054print('################################')
4055print('Working Base :',Base, ' using ', sys.platform)
4056print('################################')
4057################################################################
4058Company='04-Clark'
4059sInputFileName1='01-Vermeulen/01-Retrieve/01-EDS/02-Python/Retrieve-Country-Currency.csv'
4060sInputFileName2='04-Clark/01-Retrieve/01-EDS/01-R/Retrieve_Euro_EchangeRates.csv'
4061################################################################
4062sDataBaseDir=Base + '/' + Company + '/02-Assess/SQLite'
4063
4064if not os.path.exists(sDataBaseDir):
4065os.makedirs(sDataBaseDir)
4066################################################################
4067sDatabaseName=sDataBaseDir + '/clark.db'
4068conn = sq.connect(sDatabaseName)
4069################################################################
4070### Import Country Data
4071################################################################
4072sFileName1=Base + '/' + sInputFileName1
4073print('################################')
4074print('Loading :',sFileName1)
4075print('################################')
4076CountryRawData=pd.read_csv(sFileName1,header=0,low_memory=False, encoding="latin-1")
4077CountryRawData.drop_duplicates(subset=None, keep='first', inplace=True)
4078CountryData=CountryRawData
4079print('Loaded Company :',CountryData.columns.values)
4080print('################################')
4081################################################################
4082print('################')
4083sTable='Assess_Country'
4084print('Storing :',sDatabaseName,' Table:',sTable)
4085CountryData.to_sql(sTable, conn, if_exists="replace")
4086print('################')
4087################################################################
4088print(CountryData.head())
4089print('################################')
4090print('Rows : ',CountryData.shape[0])
4091print('################################')
4092################################################################
4093### Import Forex Data
4094################################################################
4095sFileName2=Base + '/' + sInputFileName2
4096print('################################')
4097print('Loading :',sFileName2)
4098print('################################')
4099ForexRawData=pd.read_csv(sFileName2,header=0,low_memory=False, encoding="latin-1")
4100ForexRawData.drop_duplicates(subset=None, keep='first', inplace=True)
4101ForexData=ForexRawData.head(5)
4102print('Loaded Company :',ForexData.columns.values)
4103print('################################')
4104################################################################
4105print('################')
4106sTable='Assess_Forex'
4107print('Storing :',sDatabaseName,' Table:',sTable)
4108ForexData.to_sql(sTable, conn, if_exists="replace")
4109print('################')
4110################################################################
4111print(ForexData.head())
4112print('################################')
4113print('Rows : ',ForexData.shape[0])
4114print('################################')
4115################################################################
4116print('################')
4117sTable='Assess_Forex'
4118print('Loading :',sDatabaseName,' Table:',sTable)
4119sSQL="select distinct"
4120sSQL=sSQL+ " A.CodeIn"
4121sSQL=sSQL+ " from"
4122sSQL=sSQL+ " Assess_Forex as A;"
4123CodeData=pd.read_sql_query(sSQL, conn)
4124print('################')
4125################################################################
4126for c in range(CodeData.shape[0]):
4127print('################')
4128sTable='Assess_Forex & 2x Country > ' + CodeData['CodeIn'][c]
4129print('Loading :',sDatabaseName,' Table:',sTable)
4130sSQL="select distinct"
4131sSQL=sSQL+ " A.Date,"
4132sSQL=sSQL+ " A.CodeIn,"
4133sSQL=sSQL+ " B.Country as CountryIn,"
4134sSQL=sSQL+ " B.Currency as CurrencyNameIn,"
4135sSQL=sSQL+ " A.CodeOut,"
4136sSQL=sSQL+ " C.Country as CountryOut,"
4137sSQL=sSQL+ " C.Currency as CurrencyNameOut,"
4138sSQL=sSQL+ " A.Rate"
4139sSQL=sSQL+ " from"
4140sSQL=sSQL+ " Assess_Forex as A"
4141sSQL=sSQL+ " JOIN"
4142sSQL=sSQL+ " Assess_Country as B"
4143sSQL=sSQL+ " ON A.CodeIn = B.CurrencyCode"
4144sSQL=sSQL+ " JOIN"
4145sSQL=sSQL+ " Assess_Country as C"
4146sSQL=sSQL+ " ON A.CodeOut = C.CurrencyCode"
4147sSQL=sSQL+ " WHERE"
4148sSQL=sSQL+ " A.CodeIn ='" + CodeData['CodeIn'][c] + "';"
4149ForexData=pd.read_sql_query(sSQL, conn).head(1000)
4150print('################')
4151print(ForexData)
4152print('################')
4153sTable='Assess_Forex_' + CodeData['CodeIn'][c]
4154print('Storing :',sDatabaseName,' Table:',sTable)
4155ForexData.to_sql(sTable, conn, if_exists="replace")
4156print('################')
4157print('################################')
4158print('Rows : ',ForexData.shape[0])
4159print('################################')
4160################################################################
4161print('### Done!! ############################################')
4162################################################################
4163Output:
4164This will produce a set of demonstrated values onscreen by removing duplicate records and other related data
4165processing.
4166
4167
4168
4169
4170
4171
4172
4173
4174
4175
4176
4177
4178
4179
4180
4181
4182
4183
4184
4185
4186
4187
4188
4189L. Write a Python program to process the balance sheet to ensure that only good data is
4190processing.
4191Financials
4192Clark requires you to process the balance sheet for the VKHCG group companies. Go through a sample balance sheet data assessment, to ensure that only the good data is processed.
4193Open Python editor and create a file named Assess-Financials.py in directory
4194C:\VKHCG\04-Clark\02-Assess.
4195################################################################
4196import sys
4197import os
4198import sqlite3 as sq
4199import pandas as pd
4200################################################################
4201if sys.platform == 'linux':
4202Base=os.path.expanduser('~') + '/VKHCG'
4203else:
4204Base='C:/VKHCG'
4205print('################################')
4206print('Working Base :',Base, ' using ', sys.platform)
4207print('################################')
4208###############################################################
4209Company='04-Clark'
4210sInputFileName='01-Retrieve/01-EDS/01-R/Retrieve_Profit_And_Loss.csv'
4211################################################################
4212sDataBaseDir=Base + '/' + Company + '/02-Assess/SQLite'
4213if not os.path.exists(sDataBaseDir):
4214os.makedirs(sDataBaseDir)
4215################################################################
4216sDatabaseName=sDataBaseDir + '/clark.db'
4217conn = sq.connect(sDatabaseName)
4218################################################################
4219### Import Financial Data
4220################################################################
4221sFileName=Base + '/' + Company + '/' + sInputFileName
4222print('################################')
4223print('Loading :',sFileName)
4224print('################################')
4225FinancialRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4226FinancialData=FinancialRawData
4227print('Loaded Company :',FinancialData.columns.values)
4228print('################################')
4229################################################################
4230print('################')
4231sTable='Assess-Financials'
4232print('Storing :',sDatabaseName,' Table:',sTable)
4233FinancialData.to_sql(sTable, conn, if_exists="replace")
4234print('################')
4235################################################################
4236print(FinancialData.head())
4237print('################################')
4238print('Rows : ',FinancialData.shape[0])
4239print('################################')
4240################################################################
4241################################################################
4242print('### Done!! ############################################')
4243################################################################
4244Write a Python program to store all master records for the financial calendar
4245Financial Calendar
4246Clark stores all the master records for the financial calendar. So we import thecalendar from the retrieve step’s
4247data storage.
4248Open Python editor and create a file named Assess-Calendar.py in directory
4249C:\VKHCG\04-Clark\02-Assess.
4250################################################################
4251import sys
4252import os
4253import sqlite3 as sq
4254import pandas as pd
4255################################################################
4256if sys.platform == 'linux':
4257Base=os.path.expanduser('~') + '/VKHCG'
4258else:
4259Base='C:/VKHCG'
4260print('################################')
4261print('Working Base :',Base, ' using ', sys.platform)
4262print('################################')
4263################################################################
4264Company='04-Clark'
4265################################################################
4266sDataBaseDirIn=Base + '/' + Company + '/01-Retrieve/SQLite'
4267if not os.path.exists(sDataBaseDirIn):
4268os.makedirs(sDataBaseDirIn)
4269sDatabaseNameIn=sDataBaseDirIn + '/clark.db'
4270connIn = sq.connect(sDatabaseNameIn)
4271################################################################
4272sDataBaseDirOut=Base + '/' + Company + '/01-Retrieve/SQLite'
4273if not os.path.exists(sDataBaseDirOut):
4274os.makedirs(sDataBaseDirOut)
4275sDatabaseNameOut=sDataBaseDirOut + '/clark.db'
4276connOut = sq.connect(sDatabaseNameOut)
4277################################################################
4278sTableIn='Retrieve_Date'
4279sSQL='select * FROM ' + sTableIn + ';'
4280print('################')
4281sTableOut='Assess_Time'
4282print('Loading :',sDatabaseNameIn,' Table:',sTableIn)
4283dateRawData=pd.read_sql_query(sSQL, connIn)
4284dateData=dateRawData
4285################################################################
4286print('################################')
4287print('Load Rows : ',dateRawData.shape[0], ' records')
4288print('################################')
4289dateData.drop_duplicates(subset='FinDate', keep='first', inplace=True)
4290################################################################
4291print('################')
4292sTableOut='Assess_Date'
4293print('Storing :',sDatabaseNameOut,' Table:',sTableOut)
4294dateData.to_sql(sTableOut, connOut, if_exists="replace")
4295print('################')
4296################################################################
4297print('################################')
4298print('Store Rows : ',dateData.shape[0], ' records')
4299print('################################')
4300################################################################
4301################################################################
4302sTableIn='Retrieve_Time'
4303sSQL='select * FROM ' + sTableIn + ';'
4304print('################')
4305sTableOut='Assess_Time'
4306print('Loading :',sDatabaseNameIn,' Table:',sTableIn)
4307timeRawData=pd.read_sql_query(sSQL, connIn)
4308timeData=timeRawData
4309################################################################
4310print('################################')
4311print('Load Rows : ',timeData.shape[0], ' records')
4312print('################################')
4313timeData.drop_duplicates(subset=None, keep='first', inplace=True)
4314################################################################
4315print('################')
4316sTableOut='Assess_Time'
4317print('Storing :',sDatabaseNameOut,' Table:',sTableOut)
4318timeData.to_sql(sTableOut, connOut, if_exists="replace")
4319print('################')
4320################################################################
4321print('################################')
4322print('Store Rows : ',timeData.shape[0], ' records')
4323print('################################')
4324################################################################
4325print('### Done!! ############################################')
4326################################################################
4327
4328
4329
4330
4331
4332
4333
4334
4335
4336
4337
4338
4339
4340M. Write a Python program to generate payroll from the given data.
4341
4342People
4343Clark Ltd generates the payroll, so it holds all the staff records. Clark also handles all payments to suppliers
4344and receives payments from customers’ details on all companies.
4345Open Python editor and create a file named Assess-People.py in directory
4346C:\VKHCG\04-Clark\02-Assess.
4347################################################################
4348import sys
4349import os
4350import sqlite3 as sq
4351import pandas as pd
4352################################################################
4353Base='C:/VKHCG'
4354print('################################')
4355print('Working Base :',Base, ' using ', sys.platform)
4356print('################################')
4357################################################################
4358Company='04-Clark'
4359sInputFileName1='01-Retrieve/01-EDS/02-Python/Retrieve-Data_female-names.csv'
4360sInputFileName2='01-Retrieve/01-EDS/02-Python/Retrieve-Data_male-names.csv'
4361sInputFileName3='01-Retrieve/01-EDS/02-Python/Retrieve-Data_last-names.csv'
4362sOutputFileName1='Assess-Staff.csv'
4363sOutputFileName2='Assess-Customers.csv'
4364################################################################
4365sDataBaseDir=Base + '/' + Company + '/02-Assess/SQLite'
4366if not os.path.exists(sDataBaseDir):
4367os.makedirs(sDataBaseDir)
4368################################################################
4369sDatabaseName=sDataBaseDir + '/clark.db'
4370conn = sq.connect(sDatabaseName)
4371################################################################
4372### Import Female Data
4373################################################################
4374sFileName=Base + '/' + Company + '/' + sInputFileName1
4375print('################################')
4376print('Loading :',sFileName)
4377print('################################')
4378print(sFileName)
4379FemaleRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4380FemaleRawData.rename(columns={'NameValues' : 'FirstName'},inplace=True)
4381FemaleRawData.drop_duplicates(subset=None, keep='first', inplace=True)
4382FemaleData=FemaleRawData.sample(100)
4383print('################################')
4384################################################################
4385print('################')
4386sTable='Assess_FemaleName'
4387print('Storing :',sDatabaseName,' Table:',sTable)
4388FemaleData.to_sql(sTable, conn, if_exists="replace")
4389print('################')
4390print('################################')
4391print('Rows : ',FemaleData.shape[0], ' records')
4392print('################################')
4393################################################################
4394### Import Male Data
4395sFileName=Base + '/' + Company + '/' + sInputFileName2
4396print('################################')
4397print('Loading :',sFileName)
4398print('################################')
4399MaleRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4400MaleRawData.rename(columns={'NameValues' : 'FirstName'},inplace=True)
4401MaleRawData.drop_duplicates(subset=None, keep='first', inplace=True)
4402MaleData=MaleRawData.sample(100)
4403print('################################')
4404sTable='Assess_MaleName'
4405print('Storing :',sDatabaseName,' Table:',sTable)
4406MaleData.to_sql(sTable, conn, if_exists="replace")
4407print('################')
4408################################################################
4409print('################################')
4410print('Rows : ',MaleData.shape[0], ' records')
4411print('################################')
4412################################################################
4413### Import Surname Data
4414sFileName=Base + '/' + Company + '/' + sInputFileName3
4415print('################################')
4416print('Loading :',sFileName)
4417print('################################')
4418SurnameRawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4419SurnameRawData.rename(columns={'NameValues' : 'LastName'},inplace=True)
4420SurnameRawData.drop_duplicates(subset=None, keep='first', inplace=True)
4421SurnameData=SurnameRawData.sample(200)
4422print('################')
4423sTable='Assess_Surname'
4424print('Storing :',sDatabaseName,' Table:',sTable)
4425SurnameData.to_sql(sTable, conn, if_exists="replace")
4426print('################')
4427print('################################')
4428print('Rows : ',SurnameData.shape[0], ' records')
4429print('################################')
4430print('################')
4431sTable='Assess_FemaleName & Assess_MaleName'
4432print('Loading :',sDatabaseName,' Table:',sTable)
4433sSQL="select distinct"
4434sSQL=sSQL+ " A.FirstName,"
4435sSQL=sSQL+ " 'Female' as Gender"
4436sSQL=sSQL+ " from"
4437sSQL=sSQL+ " Assess_FemaleName as A"
4438sSQL=sSQL+ " UNION"
4439sSQL=sSQL+ " select distinct"
4440sSQL=sSQL+ " A.FirstName,"
4441sSQL=sSQL+ " 'Male' as Gender"
4442sSQL=sSQL+ " from"
4443sSQL=sSQL+ " Assess_MaleName as A;"
4444FirstNameData=pd.read_sql_query(sSQL, conn)
4445print('################')
4446#################################################################
4447#print('################')
4448sTable='Assess_FirstName'
4449print('Storing :',sDatabaseName,' Table:',sTable)
4450FirstNameData.to_sql(sTable, conn, if_exists="replace")
4451print('################')
4452################################################################
4453################################################################
4454print('################')
4455sTable='Assess_FirstName x2 & Assess_Surname'
4456print('Loading :',sDatabaseName,' Table:',sTable)
4457sSQL="select distinct"
4458sSQL=sSQL+ " A.FirstName,"
4459sSQL=sSQL+ " B.FirstName AS SecondName,"
4460sSQL=sSQL+ " C.LastName,"
4461sSQL=sSQL+ " A.Gender"
4462sSQL=sSQL+ " from"
4463sSQL=sSQL+ " Assess_FirstName as A"
4464sSQL=sSQL+ " ,"
4465sSQL=sSQL+ " Assess_FirstName as B"
4466sSQL=sSQL+ " ,"
4467sSQL=sSQL+ " Assess_Surname as C"
4468sSQL=sSQL+ " WHERE"
4469sSQL=sSQL+ " A.Gender = B.Gender"
4470sSQL=sSQL+ " AND"
4471sSQL=sSQL+ " A.FirstName <> B.FirstName;"
4472PeopleRawData=pd.read_sql_query(sSQL, conn)
4473People1Data=PeopleRawData.sample(10000)
4474sTable='Assess_FirstName & Assess_Surname'
4475print('Loading :',sDatabaseName,' Table:',sTable)
4476sSQL="select distinct"
4477sSQL=sSQL+ " A.FirstName,"
4478sSQL=sSQL+ " '' AS SecondName,"
4479sSQL=sSQL+ " B.LastName,"
4480sSQL=sSQL+ " A.Gender"
4481sSQL=sSQL+ " from"
4482sSQL=sSQL+ " Assess_FirstName as A"
4483sSQL=sSQL+ " ,"
4484sSQL=sSQL+ " Assess_Surname as B;"
4485PeopleRawData=pd.read_sql_query(sSQL, conn)
4486People2Data=PeopleRawData.sample(10000)
4487PeopleData=People1Data.append(People2Data)
4488print(PeopleData)
4489print('################')
4490#################################################################
4491#print('################')
4492sTable='Assess_People'
4493print('Storing :',sDatabaseName,' Table:',sTable)
4494PeopleData.to_sql(sTable, conn, if_exists="replace")
4495print('################')
4496################################################################
4497sFileDir=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
4498if not os.path.exists(sFileDir):
4499os.makedirs(sFileDir)
4500################################################################
4501sOutputFileName = sTable+'.csv'
4502sFileName=sFileDir + '/' + sOutputFileName
4503print('################################')
4504print('Storing :', sFileName)
4505print('################################')
4506PeopleData.to_csv(sFileName, index = False)
4507print('################################')
4508################################################################
4509print('### Done!! ############################################')
4510################################################################
4511OUTPUT:
4512
4513
4514
4515
4516
4517
4518
4519Practical 6
4520
4521Aim:Processing Data
4522
4523A. Build the time hub, links, and satellites.
4524
4525Open your Python editor and create a file named
4526Process_Time.py.
4527Save it into directory
4528C:\VKHCG\01-Vermeulen\03-Process.
4529################################################################
4530# -*- coding: utf-8 -*-
4531################################################################
4532import sys
4533import os
4534from datetime import datetime
4535from datetime import timedelta
4536from pytz import timezone, all_timezones
4537import pandas as pd
4538import sqlite3 as sq
4539from pandas.io import sql
4540import uuid
4541pd.options.mode.chained_assignment = None
4542################################################################
4543if sys.platform == 'linux':
4544Base=os.path.expanduser('~') + '/VKHCG'
4545else:
4546Base='C:/VKHCG'
4547print('################################')
4548print('Working Base :',Base, ' using ', sys.platform)
4549print('################################')
4550################################################################
4551Company='01-Vermeulen'
4552InputDir='00-RawData'
4553InputFileName='VehicleData.csv'
4554################################################################
4555sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite'
4556if not os.path.exists(sDataBaseDir):
4557os.makedirs(sDataBaseDir)
4558################################################################
4559sDatabaseName=sDataBaseDir + '/Hillman.db'
4560conn1 = sq.connect(sDatabaseName)
4561################################################################
4562sDataVaultDir=Base + '/88-DV'
4563if not os.path.exists(sDataBaseDir):
4564os.makedirs(sDataBaseDir)
4565################################################################
4566sDatabaseName=sDataVaultDir + '/datavault.db'
4567conn2 = sq.connect(sDatabaseName)
4568################################################################
4569#You will set up a time hub for a period of ten years before January 1, 2018. If you want to experiment with different periods, simply change the parameters.
4570
4571base = datetime(2018,1,1,0,0,0)
4572numUnits=10*365*24
4573################################################################
4574date_list = [base - timedelta(hours=x) for x in range(0, numUnits)]
4575t=0
4576for i in date_list:
4577now_utc=i.replace(tzinfo=timezone('UTC'))
4578sDateTime=now_utc.strftime("%Y-%m-%d %H:%M:%S")
4579print(sDateTime)
4580sDateTimeKey=sDateTime.replace(' ','-').replace(':','-')
4581t+=1
4582IDNumber=str(uuid.uuid4())
4583TimeLine=[('ZoneBaseKey', ['UTC']),
4584('IDNumber', [IDNumber]),
4585('nDateTimeValue', [now_utc]),
4586('DateTimeValue', [sDateTime]),
4587('DateTimeKey', [sDateTimeKey])]
4588if t==1:
4589TimeFrame = pd.DataFrame.from_items(TimeLine)
4590else:
4591TimeRow = pd.DataFrame.from_items(TimeLine)
4592TimeFrame = TimeFrame.append(TimeRow)
4593################################################################
4594TimeHub=TimeFrame[['IDNumber','ZoneBaseKey','DateTimeKey','DateTimeValue']]
4595TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
4596################################################################
4597TimeFrame.set_index(['IDNumber'],inplace=True)
4598################################################################
4599sTable = 'Process-Time'
4600print('Storing :',sDatabaseName,' Table:',sTable)
4601TimeHubIndex.to_sql(sTable, conn1, if_exists="replace")
4602################################################################
4603sTable = 'Hub-Time'
4604print('Storing :',sDatabaseName,' Table:',sTable)
4605TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
4606################################################################
4607active_timezones=all_timezones
4608z=0
4609for zone in active_timezones:
4610t=0
4611for j in range(TimeFrame.shape[0]):
4612now_date=TimeFrame['nDateTimeValue'][j]
4613DateTimeKey=TimeFrame['DateTimeKey'][j]
4614now_utc=now_date.replace(tzinfo=timezone('UTC'))
4615sDateTime=now_utc.strftime("%Y-%m-%d %H:%M:%S")
4616now_zone = now_utc.astimezone(timezone(zone))
4617sZoneDateTime=now_zone.strftime("%Y-%m-%d %H:%M:%S")
4618print(sZoneDateTime)
4619t+=1
4620z+=1
4621IDZoneNumber=str(uuid.uuid4())
4622TimeZoneLine=[('ZoneBaseKey', ['UTC']),
4623('IDZoneNumber', [IDZoneNumber]),
4624('DateTimeKey', [DateTimeKey]),
4625('UTCDateTimeValue', [sDateTime]),
4626('Zone', [zone]),
4627('DateTimeValue', [sZoneDateTime])]
4628if t==1:
4629TimeZoneFrame = pd.DataFrame.from_items(TimeZoneLine)
4630else:
4631TimeZoneRow = pd.DataFrame.from_items(TimeZoneLine)
4632TimeZoneFrame = TimeZoneFrame.append(TimeZoneRow)
4633TimeZoneFrameIndex=TimeZoneFrame.set_index(['IDZoneNumber'],inplace=False)
4634sZone=zone.replace('/','-').replace(' ','')
4635#############################################################
4636sTable = 'Process-Time-'+sZone
4637print('Storing :',sDatabaseName,' Table:',sTable)
4638TimeZoneFrameIndex.to_sql(sTable, conn1, if_exists="replace")
4639#################################################################
4640#############################################################
4641sTable = 'Satellite-Time-'+sZone
4642print('Storing :',sDatabaseName,' Table:',sTable)
4643TimeZoneFrameIndex.to_sql(sTable, conn2, if_exists="replace")
4644#################################################################
4645print('################')
4646print('Vacuum Databases')
4647sSQL="VACUUM;"
4648sql.execute(sSQL,conn1)
4649sql.execute(sSQL,conn2)
4650print('################')
4651#################################################################
4652print('### Done!! ############################################')
4653#################################################################
4654You have built your first hub and satellites for time in the data vault.
4655The data vault has been built in directory ..\ VKHCG\88-DV\datavault.db. You can access it with your SQLit Tools
4656
4657
4658
4659B. Golden Nominal Person HUB
4660
4661A golden nominal record is a single person’s record, with distinctive references for use by all systems. This gives the system a single view of the person. I use first name, other names, last name, and birth date as my golden nominal. The data we have in the assess directory requires a birth date to become a golden nominal.
4662The proram will generate a golden nominal using our sample data set.
4663Open your Python editor and create a file called
4664Process-People.py in the ..
4665C:\VKHCG\04-Clark\03-Process directory.
4666
4667import sys
4668import os
4669import sqlite3 as sq
4670import pandas as pd
4671from pandas.io import sql
4672from datetime import datetime, timedelta
4673from pytz import timezone, all_timezones
4674from random import randint
4675import uuid
4676################################################################
4677if sys.platform == 'linux':
4678Base=os.path.expanduser('~') + '/VKHCG'
4679else:
4680Base='C:/VKHCG'
4681print('################################')
4682print('Working Base :',Base, ' using ', sys.platform)
4683print('################################')
4684################################################################
4685Company='04-Clark'
4686sInputFileName='02-Assess/01-EDS/02-Python/Assess_People.csv'
4687################################################################
4688sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite'
4689if not os.path.exists(sDataBaseDir):
4690os.makedirs(sDataBaseDir)
4691################################################################
4692sDatabaseName=sDataBaseDir + '/clark.db'
4693conn1 = sq.connect(sDatabaseName)
4694################################################################
4695sDataVaultDir=Base + '/88-DV'
4696if not os.path.exists(sDataBaseDir):
4697os.makedirs(sDataBaseDir)
4698################################################################
4699sDatabaseName=sDataVaultDir + '/datavault.db'
4700conn2 = sq.connect(sDatabaseName)
4701################################################################
4702### Import Female Data
4703################################################################
4704sFileName=Base + '/' + Company + '/' + sInputFileName
4705print('################################')
4706print('Loading :',sFileName)
4707print('################################')
4708print(sFileName)
4709RawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4710RawData.drop_duplicates(subset=None, keep='first', inplace=True)
4711start_date = datetime(1900,1,1,0,0,0)
4712start_date_utc=start_date.replace(tzinfo=timezone('UTC'))
4713HoursBirth=100*365*24
4714RawData['BirthDateUTC']=RawData.apply(lambda row:
4715(start_date_utc + timedelta(hours=randint(0, HoursBirth))),axis=1)
4716zonemax=len(all_timezones)-1
4717RawData['TimeZone']=RawData.apply(lambda row:
4718(all_timezones[randint(0, zonemax)]),axis=1)
4719RawData['BirthDateISO']=RawData.apply(lambda row:
4720row["BirthDateUTC"].astimezone(timezone(row['TimeZone'])),axis=1)
4721RawData['BirthDateKey']=RawData.apply(lambda row:
4722row["BirthDateUTC"].strftime("%Y-%m-%d %H:%M:%S"),axis=1)
4723RawData['BirthDate']=RawData.apply(lambda row:
4724row["BirthDateISO"].strftime("%Y-%m-%d %H:%M:%S"),axis=1)
4725RawData['PersonID']=RawData.apply(lambda row:
4726str(uuid.uuid4()),axis=1)
4727################################################################
4728Data=RawData.copy()
4729Data.drop('BirthDateUTC', axis=1,inplace=True)
4730Data.drop('BirthDateISO', axis=1,inplace=True)
4731indexed_data = Data.set_index(['PersonID'])
4732print('################################')
4733#################################################################
4734print('################')
4735sTable='Process_Person'
4736print('Storing :',sDatabaseName,' Table:',sTable)
4737indexed_data.to_sql(sTable, conn1, if_exists="replace")
4738print('################')
4739################################################################
4740PersonHubRaw=Data[['PersonID','FirstName','SecondName','LastName','BirthDateKey']]
4741PersonHubRaw['PersonHubID']=RawData.apply(lambda row:
4742str(uuid.uuid4()),axis=1)
4743PersonHub=PersonHubRaw.drop_duplicates(subset=None, \keep='first',\inplace=False)
4744indexed_PersonHub = PersonHub.set_index(['PersonHubID'])
4745sTable = 'Hub-Person'
4746print('Storing :',sDatabaseName,' Table:',sTable)
4747indexed_PersonHub.to_sql(sTable, conn2, if_exists="replace")
4748###############################################################
4749PersonSatelliteGenderRaw=Data[['PersonID','FirstName','SecondName','LastName'\,'BirthDateKey','Gender']]
4750PersonSatelliteGenderRaw['PersonSatelliteID']=RawData.apply(lambda row:
4751str(uuid.uuid4()),axis=1)
4752PersonSatelliteGender=PersonSatelliteGenderRaw.drop_duplicates(subset=None, \
4753keep='first', \inplace=False)
4754indexed_PersonSatelliteGender = PersonSatelliteGender.set_index(['PersonSatelliteID'])
4755sTable = 'Satellite-Person-Gender'
4756print('Storing :',sDatabaseName,' Table:',sTable)
4757indexed_PersonSatelliteGender.to_sql(sTable, conn2, if_exists="replace")
4758################################################################
4759PersonSatelliteBirthdayRaw=Data[['PersonID','FirstName','SecondName','LastName',\'BirthDateKey','TimeZone','BirthDate']]
4760PersonSatelliteBirthdayRaw['PersonSatelliteID']=RawData.apply(lambda row:str(uuid.uuid4()),axis=1)
4761PersonSatelliteBirthday=PersonSatelliteBirthdayRaw.drop_duplicates(subset=None, \keep='first',\inplace=False)
4762indexed_PersonSatelliteBirthday = PersonSatelliteBirthday.set_index(['PersonSatelliteID'])
4763sTable = 'Satellite-Person-Names'
4764print('Storing :',sDatabaseName,' Table:',sTable)
4765indexed_PersonSatelliteBirthday.to_sql(sTable, conn2, if_exists="replace")
4766################################################################
4767sFileDir=Base + '/' + Company + '/03-Process/01-EDS/02-Python'
4768if not os.path.exists(sFileDir):
4769os.makedirs(sFileDir)
4770################################################################
4771sOutputFileName = sTable + '.csv'
4772sFileName=sFileDir + '/' + sOutputFileName
4773print('################################')
4774print('Storing :', sFileName)
4775print('################################')
4776RawData.to_csv(sFileName, index = False)
4777print('################################')
4778#################################################################
4779print('################')
4780print('Vacuum Databases')
4781sSQL="VACUUM;"
4782sql.execute(sSQL,conn1)
4783sql.execute(sSQL,conn2)
4784print('################')
4785#################################################################
4786print('### Done!! ############################################')
4787#################################################################
4788Output :
4789It will apply golden nominal rules by assuming nobody born before January 1, 1900, droping to two
4790ISO complex date time structures, as the code does not translate into SQLite’s data types and saves
4791your new golden nominal to a CSV file.
4792Load the person into the data vault
4793========== RESTART: C:\VKHCG\04-Clark\03-Process\Process-People.py ==========
4794################################
4795Working Base : C:/VKHCG using win32
4796################################
4797################################
4798Loading : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_People.csv
4799################################
4800C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_People.csv
4801################################
4802################
4803Storing : C:/VKHCG/88-DV/datavault.db Table: Process_Person
4804################
4805Storing : C:/VKHCG/88-DV/datavault.db Table: Satellite-Person-Gender
4806Storing : C:/VKHCG/88-DV/datavault.db Table: Satellite-Person-Names
4807################################
4808Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Satellite-Person-Names.csv
4809################################
4810################################
4811################
4812Vacuum Databases
4813################
4814### Done!! ############################################
4815
4816
4817
4818
4819
4820
4821
4822
4823
4824
4825
4826
4827
4828
4829
4830
4831
4832
4833
4834
4835
4836C. Human-Environment Interaction
4837
4838The interaction of humans with their environment is a major relationship that guides people’s behavior and the
4839characteristics of the location. Activities such as mining and other industries, roads, and landscaping at a
4840location create both positive and negative effects on the environment, but also on humans. A location
4841earmarked as a green belt, to assist in reducing the carbon footprint, or a new interstate change its current and
4842future characteristics. The location is a main data source for the data science, and, normally, we find unknown
4843or unexpected effects on the data insights. In the Python editor, open a new file named Process_Location.py in
4844directory ..\VKHCG\01-Vermeulen\03-Process.
4845################################################################
4846# -*- coding: utf-8 -*-
4847################################################################
4848import sys
4849import os
4850import pandas as pd
4851import sqlite3 as sq
4852from pandas.io import sql
4853import uuid
4854################################################################
4855Base='C:/VKHCG'
4856print('################################')
4857print('Working Base :',Base, ' using ', sys.platform)
4858print('################################')
4859################################################################
4860Company='01-Vermeulen'
4861InputAssessGraphName='Assess_All_Animals.gml'
4862EDSAssessDir='02-Assess/01-EDS'
4863InputAssessDir=EDSAssessDir + '/02-Python'
4864################################################################
4865sFileAssessDir=Base + '/' + Company + '/' + InputAssessDir
4866if not os.path.exists(sFileAssessDir):
4867os.makedirs(sFileAssessDir)
4868################################################################
4869sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite'
4870if not os.path.exists(sDataBaseDir):
4871os.makedirs(sDataBaseDir)
4872################################################################
4873sDatabaseName=sDataBaseDir + '/Vermeulen.db'
4874conn1 = sq.connect(sDatabaseName)
4875################################################################
4876sDataVaultDir=Base + '/88-DV'
4877if not os.path.exists(sDataBaseDir):
4878os.makedirs(sDataBaseDir)
4879################################################################
4880sDatabaseName=sDataVaultDir + '/datavault.db'
4881conn2 = sq.connect(sDatabaseName)
4882t=0
4883tMax=360*180
4884for Longitude in range(-180,180,10):
4885for Latitude in range(-90,90,10):
4886t+=1
4887IDNumber=str(uuid.uuid4())
4888LocationName='L'+format(round(Longitude,3)*1000, '+07d') +\'-'+format(round(Latitude,3)*1000, '+07d')
4889print('Create:',t,' of ',tMax,':',LocationName)
4890LocationLine=[('ObjectBaseKey', ['GPS']),('IDNumber', [IDNumber]),('LocationNumber', [str(t)]),('LocationName', [LocationName]),('Longitude', [Longitude]),('Latitude', [Latitude])]
4891if t==1:
4892LocationFrame = pd.DataFrame.from_items(LocationLine)
4893else:
4894LocationRow = pd.DataFrame.from_items(LocationLine)
4895LocationFrame = LocationFrame.append(LocationRow)
4896################################################################
4897LocationHubIndex=LocationFrame.set_index(['IDNumber'],inplace=False)
4898################################################################
4899sTable = 'Process-Location'
4900print('Storing :',sDatabaseName,' Table:',sTable)
4901LocationHubIndex.to_sql(sTable, conn1, if_exists="replace")
4902#################################################################
4903sTable = 'Hub-Location'
4904print('Storing :',sDatabaseName,' Table:',sTable)
4905LocationHubIndex.to_sql(sTable, conn2, if_exists="replace")
4906#################################################################
4907print('################')
4908print('Vacuum Databases')
4909sSQL="VACUUM;"
4910sql.execute(sSQL,conn1)
4911sql.execute(sSQL,conn2)
4912print('################')
4913################################################################
4914print('### Done!! ############################################')
4915Output:
4916
4917
4918D. Forecasting
4919Forecasting is the ability to project a possible future, by looking at historical data. The datavault enables these
4920types of investigations, owing to the complete history it collects as itprocesses the source’s systems data. A
4921data scientist supply answers to such questions as the following:
4922• What should we buy?
4923• What should we sell?
4924• Where will our next business come from?
4925People want to know what you calculate to determine what is about to happen.
4926Open a new file in your Python editor and save it as Process-Shares-Data.py in directory
4927C: \VKHCG\04-Clark\03-Process. I will guide you through this
4928process. You will require a library called quandl
4929type pip install quandl in cmd
4930################################################################
4931import sys
4932import os
4933import sqlite3 as sq
4934import quandl
4935import pandas as pd
4936################################################################
4937Base='C:/VKHCG'
4938print('################################')
4939print('Working Base :',Base, ' using ', sys.platform)
4940print('################################')
4941################################################################
4942Company='04-Clark'
4943sInputFileName='00-RawData/VKHCG_Shares.csv'
4944sOutputFileName='Shares.csv'
4945################################################################
4946sDataBaseDir=Base + '/' + Company + '/03-Process/SQLite'
4947if not os.path.exists(sDataBaseDir):
4948os.makedirs(sDataBaseDir)
4949################################################################
4950sFileDir1=Base + '/' + Company + '/01-Retrieve/01-EDS/02-Python'
4951if not os.path.exists(sFileDir1):
4952os.makedirs(sFileDir1)
4953################################################################
4954sFileDir2=Base + '/' + Company + '/02-Assess/01-EDS/02-Python'
4955if not os.path.exists(sFileDir2):
4956os.makedirs(sFileDir2)
4957################################################################
4958sFileDir3=Base + '/' + Company + '/03-Process/01-EDS/02-Python'
4959if not os.path.exists(sFileDir3):
4960os.makedirs(sFileDir3)
4961################################################################
4962sDatabaseName=sDataBaseDir + '/clark.db'
4963conn = sq.connect(sDatabaseName)
4964################################################################
4965### Import Share Names Data
4966################################################################
4967sFileName=Base + '/' + Company + '/' + sInputFileName
4968print('################################')
4969print('Loading :',sFileName)
4970print('################################')
4971RawData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
4972RawData.drop_duplicates(subset=None, keep='first', inplace=True)
4973print('Rows :',RawData.shape[0])
4974print('Columns:',RawData.shape[1])
4975print('################')
4976################################################################
4977sFileName=sFileDir1 + '/Retrieve_' + sOutputFileName
4978print('################################')
4979print('Storing :', sFileName)
4980print('################################')
4981RawData.to_csv(sFileName, index = False)
4982print('################################')
4983################################################################
4984sFileName=sFileDir2 + '/Assess_' + sOutputFileName
4985print('################################')
4986print('Storing :', sFileName)
4987print('################################')
4988RawData.to_csv(sFileName, index = False)
4989print('################################')
4990################################################################
4991sFileName=sFileDir3 + '/Process_' + sOutputFileName
4992print('################################')
4993print('Storing :', sFileName)
4994print('################################')
4995RawData.to_csv(sFileName, index = False)
4996print('################################')
4997################################################################
4998### Import Shares Data Details
4999nShares=RawData.shape[0]
5000#nShares=6
5001for sShare in range(nShares):
5002sShareName=str(RawData['Shares'][sShare])
5003ShareData = quandl.get(sShareName)
5004UnitsOwn=RawData['Units'][sShare]
5005ShareData['UnitsOwn']=ShareData.apply(lambda row:(UnitsOwn),axis=1)
5006ShareData['ShareCode']=ShareData.apply(lambda row:(sShareName),axis=1)
5007print('################')
5008print('Share :',sShareName)
5009print('Rows :',ShareData.shape[0])
5010print('Columns:',ShareData.shape[1])
5011print('################')
5012#################################################################
5013print('################')
5014sTable=str(RawData['sTable'][sShare])
5015print('Storing :',sDatabaseName,' Table:',sTable)
5016ShareData.to_sql(sTable, conn, if_exists="replace")
5017print('################')
5018################################################################
5019sOutputFileName = sTable.replace("/","-") + '.csv'
5020sFileName=sFileDir1 + '/Retrieve_' + sOutputFileName
5021print('################################')
5022print('Storing :', sFileName)
5023print('################################')
5024ShareData.to_csv(sFileName, index = False)
5025print('################################')
5026################################################################
5027sOutputFileName = sTable.replace("/","-") + '.csv'
5028sFileName=sFileDir2 + '/Assess_' + sOutputFileName
5029print('################################')
5030print('Storing :', sFileName)
5031print('################################')
5032ShareData.to_csv(sFileName, index = False)
5033print('################################')
5034################################################################
5035sOutputFileName = sTable.replace("/","-") + '.csv'
5036sFileName=sFileDir3 + '/Process_' + sOutputFileName
5037print('################################')
5038print('Storing :', sFileName)
5039print('################################')
5040ShareData.to_csv(sFileName, index = False)
5041print('################################')
5042print('### Done!! ############################################')
5043################################################################
5044Output:
5045======== RESTART: C:\VKHCG\04-Clark\03-Process\Process-Shares-Data.py ========
5046Working Base : C:/VKHCG using win32
5047Loading : C:/VKHCG/04-Clark/00-RawData/VKHCG_Shares.csv
5048Rows : 10
5049Columns: 3
5050Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_Shares.csv
5051Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_Shares.csv
5052Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_Shares.csv
5053Share : WIKI/GOOGL
5054Rows : 3424
5055Columns: 14
5056Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: WIKI_Google
5057Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_WIKI_Google.csv
5058Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_WIKI_Google.csv
5059Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_WIKI_Google.csv
5060Share : WIKI/MSFT
5061Rows : 8076
5062Columns: 14
5063Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: WIKI_Microsoft
5064Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_WIKI_Microsoft.csv
5065Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_WIKI_Microsoft.csv
5066Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_WIKI_Microsoft.csv
5067Share : WIKI/UPS
5068Rows : 4622
5069Columns: 14
5070Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: WIKI_UPS
5071Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_WIKI_UPS.csv
5072Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_WIKI_UPS.csv
5073Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_WIKI_UPS.csv
5074Share : WIKI/AMZN
5075Rows : 5248
5076Columns: 14
5077Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: WIKI_Amazon
5078Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_WIKI_Amazon.csv
5079Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_WIKI_Amazon.csv
5080Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_WIKI_Amazon.csv
5081Share : LOCALBTC/USD
5082Rows : 1863
5083Columns: 6
5084Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: LOCALBTC_USD
5085Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_LOCALBTC_USD.csv
5086Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_LOCALBTC_USD.csv
5087Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_LOCALBTC_USD.csv
5088Share : PERTH/AUD_USD_M
5089Rows : 340
5090Columns: 8
5091Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: PERTH_AUD_USD_M
5092Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_PERTH_AUD_USD_M.csv
5093Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_PERTH_AUD_USD_M.csv
5094Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_PERTH_AUD_USD_M.csv
5095Share : PERTH/AUD_USD_D
5096Rows : 7989
5097Columns: 8
5098Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: PERTH_AUD_USD_D
5099Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_PERTH_AUD_USD_D.csv
5100Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_PERTH_AUD_USD_D.csv
5101Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_PERTH_AUD_USD_D.csv
5102Share : FRED/GDP
5103Rows : 290
5104Columns: 3
5105Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: FRED/GDP
5106Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_FRED-GDP.csv
5107Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_FRED-GDP.csv
5108Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_FRED-GDP.csv
5109Share : FED/RXI_US_N_A_UK
5110Rows : 49
5111Columns: 3
5112Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: FED_RXI_US_N_A_UK
5113Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_FED_RXI_US_N_A_UK.csv
5114Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_FED_RXI_US_N_A_UK.csv
5115Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_FED_RXI_US_N_A_UK.csv
5116Share : FED/RXI_N_A_CA
5117Rows : 49
5118Columns: 3
5119Storing : C:/VKHCG/04-Clark/03-Process/SQLite/clark.db Table: FED_RXI_N_A_CA
5120Storing : C:/VKHCG/04-Clark/01-Retrieve/01-EDS/02-Python/Retrieve_FED_RXI_N_A_CA.csv
5121Storing : C:/VKHCG/04-Clark/02-Assess/01-EDS/02-Python/Assess_FED_RXI_N_A_CA.csv
5122Storing : C:/VKHCG/04-Clark/03-Process/01-EDS/02-Python/Process_FED_RXI_N_A_CA.csv
5123### Done!! ############################################
5124
5125
5126
5127
5128
5129
5130
5131
5132
5133
5134
5135
5136
5137
5138
5139
5140
5141
5142
5143
5144
5145
5146
5147Practical 7
5148Aim:Transforming Data
5149
5150To illustrate the consolidation process, the example show a person being borne. Open a new file in
5151the Python editor and save it as Transform-Gunnarsson_is_Born.py in directory
5152C: \VKHCG\01-Vermeulen\04-Transform.
5153################################################################
5154# -*- coding: utf-8 -*-
5155################################################################
5156import sys
5157import os
5158from datetime import datetime
5159from pytz import timezone
5160import pandas as pd
5161import sqlite3 as sq
5162import uuid
5163pd.options.mode.chained_assignment = None
5164################################################################
5165Base='C:/VKHCG'
5166print('################################')
5167print('Working Base :',Base, ' using ', sys.platform)
5168print('################################')
5169################################################################
5170Company='01-Vermeulen'
5171InputDir='00-RawData'
5172InputFileName='VehicleData.csv'
5173################################################################
5174sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
5175if not os.path.exists(sDataBaseDir):
5176os.makedirs(sDataBaseDir)
5177################################################################
5178sDatabaseName=sDataBaseDir + '/Vermeulen.db'
5179conn1 = sq.connect(sDatabaseName)
5180################################################################
5181sDataVaultDir=Base + '/88-DV'
5182if not os.path.exists(sDataVaultDir):
5183os.makedirs(sDataVaultDir)
5184################################################################
5185sDatabaseName=sDataVaultDir + '/datavault.db'
5186conn2 = sq.connect(sDatabaseName)
5187sDataWarehouseDir=Base + '/99-DW'
5188if not os.path.exists(sDataWarehouseDir):
5189os.makedirs(sDataWarehouseDir)
5190################################################################
5191sDatabaseName=sDataWarehouseDir + '/datawarehouse.db'
5192conn3 = sq.connect(sDatabaseName)
5193################################################################
5194print('\n#################################')
5195print('Time Category')
5196print('UTC Time')
5197BirthDateUTC = datetime(1960,12,20,10,15,0)
5198BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
5199BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
5200BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
5201print(BirthDateZoneUTCStr)
5202print('#################################')
5203print('Birth Date in Reykjavik :')
5204BirthZone = 'Atlantic/Reykjavik'
5205BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
5206BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
5207BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
5208print(BirthDateStr)
5209print('#################################')
5210################################################################
5211IDZoneNumber=str(uuid.uuid4())
5212sDateTimeKey=BirthDateZoneStr.replace(' ','-').replace(':','-')
5213TimeLine=[('ZoneBaseKey', ['UTC']),
5214('IDNumber', [IDZoneNumber]),
5215('DateTimeKey', [sDateTimeKey]),
5216('UTCDateTimeValue', [BirthDateZoneUTC]),
5217('Zone', [BirthZone]),
5218('DateTimeValue', [BirthDateStr])]
5219TimeFrame = pd.DataFrame.from_items(TimeLine)
5220################################################################
5221TimeHub=TimeFrame[['IDNumber','ZoneBaseKey','DateTimeKey','DateTimeValue']]
5222TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
5223################################################################
5224sTable = 'Hub-Time-Gunnarsson'
5225print('\n#################################')
5226print('Storing :',sDatabaseName,'\n Table:',sTable)
5227print('\n#################################')
5228TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
5229sTable = 'Dim-Time-Gunnarsson'
5230TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")
5231################################################################
5232TimeSatellite=TimeFrame[['IDNumber','DateTimeKey','Zone','DateTimeValue']]
5233TimeSatelliteIndex=TimeSatellite.set_index(['IDNumber'],inplace=False)
5234################################################################
5235BirthZoneFix=BirthZone.replace(' ','-').replace('/','-')
5236sTable = 'Satellite-Time-' + BirthZoneFix + '-Gunnarsson'
5237print('\n#################################')
5238print('Storing :',sDatabaseName,'\n Table:',sTable)
5239print('\n#################################')
5240TimeSatelliteIndex.to_sql(sTable, conn2, if_exists="replace")
5241sTable = 'Dim-Time-' + BirthZoneFix + '-Gunnarsson'
5242TimeSatelliteIndex.to_sql(sTable, conn3, if_exists="replace")
5243################################################################
5244print('\n#################################')
5245print('Person Category')
5246FirstName = 'Guðmundur'
5247LastName = 'Gunnarsson'
5248print('Name:',FirstName,LastName)
5249print('Birth Date:',BirthDateLocal)
5250print('Birth Zone:',BirthZone)
5251print('UTC Birth Date:',BirthDateZoneStr)
5252print('#################################')
5253###############################################################
5254IDPersonNumber=str(uuid.uuid4())
5255PersonLine=[('IDNumber', [IDPersonNumber]),
5256('FirstName', [FirstName]),
5257('LastName', [LastName]),
5258('Zone', ['UTC']),
5259('DateTimeValue', [BirthDateZoneStr])]
5260PersonFrame = pd.DataFrame.from_items(PersonLine)
5261################################################################
5262TimeHub=PersonFrame
5263TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
5264################################################################
5265sTable = 'Hub-Person-Gunnarsson'
5266print('\n#################################')
5267print('Storing :',sDatabaseName,'\n Table:',sTable)
5268print('\n#################################')
5269TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
5270sTable = 'Dim-Person-Gunnarsson'
5271TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")
5272
5273
5274
5275
5276
5277
5278
5279
5280
5281
5282
5283
5284
5285
5286
5287
5288
5289
5290
5291
5292
5293
5294
5295
5296
5297Output : Guðmundur Gunnarsson was born on December 20, 1960, at 9:15 in Landspítali,Hringbraut 101, 101
5298Reykjavík, Iceland.
5299
5300You must build three items: dimension Person, dimension Time, and factPersonBornAtTime.
5301Open your Python editor and create a file named Transform-Gunnarsson-Sun-Model.py in directory
5302C:\VKHCG\01-Vermeulen\04-Transform.
5303################################################################
5304# -*- coding: utf-8 -*-
5305################################################################
5306import sys
5307import os
5308from datetime import datetime
5309from pytz import timezone
5310import pandas as pd
5311import sqlite3 as sq
5312import uuid
5313pd.options.mode.chained_assignment = None
5314################################################################
5315if sys.platform == 'linux':
5316Base=os.path.expanduser('~') + '/VKHCG'
5317else:
5318Base='C:/VKHCG'
5319print('################################')
5320print('Working Base :',Base, ' using ', sys.platform)
5321print('################################')
5322################################################################
5323Company='01-Vermeulen'
5324################################################################
5325sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
5326if not os.path.exists(sDataBaseDir):
5327os.makedirs(sDataBaseDir)
5328################################################################
5329sDatabaseName=sDataBaseDir + '/Vermeulen.db'
5330conn1 = sq.connect(sDatabaseName)
5331################################################################
5332sDataWarehousetDir=Base + '/99-DW'
5333if not os.path.exists(sDataWarehousetDir):
5334os.makedirs(sDataWarehousetDir)
5335################################################################
5336sDatabaseName=sDataWarehousetDir + '/datawarehouse.db'
5337conn2 = sq.connect(sDatabaseName)
5338################################################################
5339print('\n#################################')
5340print('Time Dimension')
5341BirthZone = 'Atlantic/Reykjavik'
5342BirthDateUTC = datetime(1960,12,20,10,15,0)
5343BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
5344BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
5345BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
5346BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
5347BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
5348BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
5349################################################################
5350IDTimeNumber=str(uuid.uuid4())
5351TimeLine=[('TimeID', [IDTimeNumber]),
5352('UTCDate', [BirthDateZoneStr]),
5353('LocalTime', [BirthDateLocal]),
5354('TimeZone', [BirthZone])]
5355TimeFrame = pd.DataFrame.from_items(TimeLine)
5356################################################################
5357DimTime=TimeFrame
5358DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
5359################################################################
5360sTable = 'Dim-Time'
5361print('\n#################################')
5362print('Storing :',sDatabaseName,'\n Table:',sTable)
5363print('\n#################################')
5364DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
5365DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")
5366################################################################
5367print('\n#################################')
5368print('Dimension Person')
5369print('\n#################################')
5370FirstName = 'Guðmundur'
5371LastName = 'Gunnarsson'
5372###############################################################
5373IDPersonNumber=str(uuid.uuid4())
5374PersonLine=[('PersonID', [IDPersonNumber]),
5375('FirstName', [FirstName]),
5376('LastName', [LastName]),
5377('Zone', ['UTC']),
5378('DateTimeValue', [BirthDateZoneStr])]
5379PersonFrame = pd.DataFrame.from_items(PersonLine)
5380################################################################
5381DimPerson=PersonFrame
5382DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
5383################################################################
5384sTable = 'Dim-Person'
5385print('\n#################################')
5386print('Storing :',sDatabaseName,'\n Table:',sTable)
5387print('\n#################################')
5388DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
5389DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
5390################################################################
5391print('\n#################################')
5392print('Fact - Person - time')
5393print('\n#################################')
5394IDFactNumber=str(uuid.uuid4())
5395PersonTimeLine=[('IDNumber', [IDFactNumber]),
5396('IDPersonNumber', [IDPersonNumber]),
5397('IDTimeNumber', [IDTimeNumber])]
5398PersonTimeFrame = pd.DataFrame.from_items(PersonTimeLine)
5399################################################################
5400FctPersonTime=PersonTimeFrame
5401FctPersonTimeIndex=FctPersonTime.set_index(['IDNumber'],inplace=False)
5402################################################################
5403sTable = 'Fact-Person-Time'
5404print('\n#################################')
5405print('Storing :',sDatabaseName,'\n Table:',sTable)
5406print('\n#################################')
5407FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")
5408FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")
5409
5410
5411
5412
5413
5414Output:
5415
5416Building a Data Warehouse
5417Open the Transform-Sun-Models.py file from directory C:\VKHCG\01-Vermeulen\04-Transform.
5418################################################################
5419# -*- coding: utf-8 -*-
5420################################################################
5421import sys
5422import os
5423from datetime import datetime
5424from pytz import timezone
5425import pandas as pd
5426import sqlite3 as sq
5427import uuid
5428pd.options.mode.chained_assignment = None
5429################################################################
5430if sys.platform == 'linux':
5431Base=os.path.expanduser('~') + '/VKHCG'
5432else:
5433Base='C:/VKHCG'
5434print('################################')
5435print('Working Base :',Base, ' using ', sys.platform)
5436print('################################')
5437################################################################
5438Company='01-Vermeulen'
5439################################################################
5440sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
5441if not os.path.exists(sDataBaseDir):
5442os.makedirs(sDataBaseDir)
5443################################################################
5444sDatabaseName=sDataBaseDir + '/Vermeulen.db'
5445conn1 = sq.connect(sDatabaseName)
5446################################################################
5447sDataVaultDir=Base + '/88-DV'
5448if not os.path.exists(sDataVaultDir):
5449os.makedirs(sDataVaultDir)
5450################################################################
5451sDatabaseName=sDataVaultDir + '/datavault.db'
5452conn2 = sq.connect(sDatabaseName)
5453################################################################
5454sDataWarehouseDir=Base + '/99-DW'
5455if not os.path.exists(sDataWarehouseDir):
5456os.makedirs(sDataWarehouseDir)
5457################################################################
5458sDatabaseName=sDataWarehouseDir + '/datawarehouse.db'
5459conn3 = sq.connect(sDatabaseName)
5460################################################################
5461sSQL=" SELECT DateTimeValue FROM [Hub-Time];"
5462DateDataRaw=pd.read_sql_query(sSQL, conn2)
5463DateData=DateDataRaw.head(1000)
5464print(DateData)
5465################################################################
5466print('\n#################################')
5467print('Time Dimension')
5468print('\n#################################')
5469t=0
5470mt=DateData.shape[0]
5471for i in range(mt):
5472BirthZone = ('Atlantic/Reykjavik','Europe/London','UCT')
5473for j in range(len(BirthZone)):
5474t+=1
5475print(t,mt*3)
5476BirthDateUTC = datetime.strptime(DateData['DateTimeValue'][i],"%Y-%m-%d %H:%M:%S")
5477BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
5478BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
5479BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
5480BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone[j]))
5481BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
5482BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
5483################################################################
5484IDTimeNumber=str(uuid.uuid4())
5485TimeLine=[('TimeID', [str(IDTimeNumber)]),
5486('UTCDate', [str(BirthDateZoneStr)]),
5487('LocalTime', [str(BirthDateLocal)]),
5488('TimeZone', [str(BirthZone)])]
5489if t==1:
5490TimeFrame = pd.DataFrame.from_items(TimeLine)
5491else:
5492TimeRow = pd.DataFrame.from_items(TimeLine)
5493TimeFrame=TimeFrame.append(TimeRow)
5494################################################################
5495DimTime=TimeFrame
5496DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
5497################################################################
5498sTable = 'Dim-Time'
5499print('\n#################################')
5500print('Storing :',sDatabaseName,'\n Table:',sTable)
5501print('\n#################################')
5502DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
5503DimTimeIndex.to_sql(sTable, conn3, if_exists="replace")
5504################################################################
5505sSQL=" SELECT " + \" FirstName," + \" SecondName," + \" LastName," + \" BirthDateKey " + \" FROM [Hub-Person];"
5506PersonDataRaw=pd.read_sql_query(sSQL, conn2)
5507PersonData=PersonDataRaw.head(1000)
5508################################################################
5509print('\n#################################')
5510print('Dimension Person')
5511print('\n#################################')
5512t=0
5513mt=DateData.shape[0]
5514for i in range(mt):
5515t+=1
5516print(t,mt)
5517FirstName = str(PersonData["FirstName"])
5518SecondName = str(PersonData["SecondName"])
5519if len(SecondName) > 0:
5520SecondName=""
5521LastName = str(PersonData["LastName"])
5522BirthDateKey = str(PersonData["BirthDateKey"])
5523IDPersonNumber=str(uuid.uuid4())
5524PersonLine=[('PersonID', [str(IDPersonNumber)]),
5525('FirstName', [FirstName]),
5526('SecondName', [SecondName]),
5527('LastName', [LastName]),
5528('Zone', [str('UTC')]),
5529('BirthDate', [BirthDateKey])]
5530if t==1:
5531PersonFrame = pd.DataFrame.from_items(PersonLine)
5532else:
5533PersonRow = pd.DataFrame.from_items(PersonLine)
5534PersonFrame = PersonFrame.append(PersonRow)
5535################################################################
5536DimPerson=PersonFrame
5537print(DimPerson)
5538DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
5539################################################################
5540sTable = 'Dim-Person'
5541print('\n#################################')
5542print('Storing :',sDatabaseName,'\n Table:',sTable)
5543print('\n#################################')
5544DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
5545DimPersonIndex.to_sql(sTable, conn3, if_exists="replace")
5546###############################################################
5547Output:
5548You have successfully performed data vault to data warehouse transformation.
5549Practical 8
5550
5551Aim:Organizing Data
5552
5553
5554A.Horizontal Style
5555
5556C:\VKHCG\01-Vermeulen\05-Organise\ Organize-Horizontal.py
5557################################################################
5558# -*- coding: utf-8 -*-
5559################################################################
5560import sys
5561import os
5562import pandas as pd
5563import sqlite3 as sq
5564################################################################
5565Base='C:/VKHCG'
5566print('################################')
5567print('Working Base :',Base, ' using ', sys.platform)
5568print('################################')
5569################################################################
5570################################################################
5571Company='01-Vermeulen'
5572################################################################
5573sDataWarehouseDir=Base + '/99-DW'
5574if not os.path.exists(sDataWarehouseDir):
5575os.makedirs(sDataWarehouseDir)
5576################################################################
5577sDatabaseName=sDataWarehouseDir + '/datawarehouse.db'
5578conn1 = sq.connect(sDatabaseName)
5579################################################################
5580sDatabaseName=sDataWarehouseDir + '/datamart.db'
5581conn2 = sq.connect(sDatabaseName)
5582################################################################
5583print('################')
5584sTable = 'Dim-BMI'
5585print('Loading :',sDatabaseName,' Table:',sTable)
5586sSQL="SELECT * FROM [Dim-BMI];"
5587PersonFrame0=pd.read_sql_query(sSQL, conn1)
5588print('################')
5589sTable = 'Dim-BMI'
5590print('Loading :',sDatabaseName,' Table:',sTable)
5591sSQL="SELECT PersonID,\Height,\Weight,\bmi,\Indicator\FROM [Dim-BMI]\WHERE \Height > 1.5 \and Indicator = 1\ORDER BY \Height,\Weight;"
5592PersonFrame1=pd.read_sql_query(sSQL, conn1)
5593################################################################
5594DimPerson=PersonFrame1
5595DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
5596################################################################
5597sTable = 'Dim-BMI'
5598print('\n#################################')
5599print('Storing :',sDatabaseName,'\n Table:',sTable)
5600print('\n#################################')
5601#DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
5602################################################################
5603print('################')
5604sTable = 'Dim-BMI'
5605print('Loading :',sDatabaseName,' Table:',sTable)
5606sSQL="SELECT * FROM [Dim-BMI];"
5607PersonFrame2=pd.read_sql_query(sSQL, conn2)
5608print('Full Data Set (Rows):', PersonFrame0.shape[0])
5609print('Full Data Set (Columns):', PersonFrame0.shape[1])
5610print('Horizontal Data Set (Rows):', PersonFrame2.shape[0])
5611print('Horizontal Data Set (Columns):', PersonFrame2.shape[1])
5612
5613Output:
5614
5615
5616
5617
5618
5619
5620
5621
5622B.Vertical Style
5623
5624Performing vertical-style slicing or subsetting of the data warehouse is achieved by applying a filter technique that forces the data warehouse to show only the data for specific preselected filtered outcomes against the data population. The vertical-style slicing selects the subset of columns from the population, while preserving the rows. That is, the data science tool can see only the preselected columns from a record for all the records in the population.
5625C:\VKHCG\01-Vermeulen\05-Organise\ Organize-Vertical.py
5626################################################################
5627# -*- coding: utf-8 -*-
5628################################################################
5629import sys
5630import os
5631import pandas as pd
5632import sqlite3 as sq
5633################################################################
5634Base='C:/VKHCG'
5635print('################################')
5636print('Working Base :',Base, ' using ', sys.platform)
5637print('################################')
5638################################################################
5639################################################################
5640Company='01-Vermeulen'
5641################################################################
5642sDataWarehouseDir=Base + '/99-DW'
5643if not os.path.exists(sDataWarehouseDir):
5644os.makedirs(sDataWarehouseDir)
5645################################################################
5646sDatabaseName=sDataWarehouseDir + '/datawarehouse.db'
5647conn1 = sq.connect(sDatabaseName)
5648################################################################
5649sDatabaseName=sDataWarehouseDir + '/datamart.db'
5650conn2 = sq.connect(sDatabaseName)
5651################################################################
5652print('################################')
5653sTable = 'Dim-BMI'
5654print('Loading :',sDatabaseName,' Table:',sTable)
5655sSQL="SELECT * FROM [Dim-BMI];"
5656PersonFrame0=pd.read_sql_query(sSQL, conn1)
5657################################################################
5658print('################################')
5659sTable = 'Dim-BMI'
5660print('Loading :',sDatabaseName,' Table:',sTable)
5661print('################################')
5662sSQL="SELECT \Height,\Weight,\Indicator\FROM [Dim-BMI];"
5663PersonFrame1=pd.read_sql_query(sSQL, conn1)
5664################################################################
5665DimPerson=PersonFrame1
5666DimPersonIndex=DimPerson.set_index(['Indicator'],inplace=False)
5667################################################################
5668sTable = 'Dim-BMI-Vertical'
5669print('\n#################################')
5670print('Storing :',sDatabaseName,'\n Table:',sTable)
5671print('\n#################################')
5672DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
5673################################################################
5674print('################')
5675sTable = 'Dim-BMI-Vertical'
5676print('Loading :',sDatabaseName,' Table:',sTable)
5677sSQL="SELECT * FROM [Dim-BMI-Vertical];"
5678PersonFrame2=pd.read_sql_query(sSQL, conn2)
5679################################################################
5680print('################################')
5681print('Full Data Set (Rows):', PersonFrame0.shape[0])
5682print('Full Data Set (Columns):', PersonFrame0.shape[1])
5683print('################################')
5684print('Horizontal Data Set (Rows):', PersonFrame2.shape[0])
5685print('Horizontal Data Set (Columns):', PersonFrame2.shape[1])
5686print('################################')
5687################################################################
5688Output:
5689
5690
5691
5692
5693
5694
5695C.Island Style
5696
5697Performing island-style slicing or subsetting of the data warehouse is achieved by applying a combination of horizontal- and vertical-style slicing. This generates a subset of specific rows and specific columns reduced at the same time.
5698C:\VKHCG\01-Vermeulen\05-Organise\ Organize-Island.py
5699
5700################################################################
5701# -*- coding: utf-8 -*-
5702################################################################
5703import sys
5704import os
5705import pandas as pd
5706import sqlite3 as sq
5707################################################################
5708Base='C:/VKHCG'
5709print('################################')
5710print('Working Base :',Base, ' using ', sys.platform)
5711print('################################')
5712################################################################
5713################################################################
5714Company='01-Vermeulen'
5715################################################################
5716sDataWarehouseDir=Base + '/99-DW'
5717if not os.path.exists(sDataWarehouseDir):
5718os.makedirs(sDataWarehouseDir)
5719################################################################
5720sDatabaseName=sDataWarehouseDir + '/datawarehouse.db'
5721conn1 = sq.connect(sDatabaseName)
5722################################################################
5723sDatabaseName=sDataWarehouseDir + '/datamart.db'
5724conn2 = sq.connect(sDatabaseName)
5725################################################################
5726print('################')
5727sTable = 'Dim-BMI'
5728print('Loading :',sDatabaseName,' Table:',sTable)
5729sSQL="SELECT * FROM [Dim-BMI];"
5730PersonFrame0=pd.read_sql_query(sSQL, conn1)
5731################################################################
5732print('################')
5733sTable = 'Dim-BMI'
5734print('Loading :',sDatabaseName,' Table:',sTable)
5735sSQL="SELECT \Height,\Weight,\Indicator\FROM [Dim-BMI]\WHERE Indicator > 2\ORDER BY \Height,\Weight;"
5736PersonFrame1=pd.read_sql_query(sSQL, conn1)
5737################################################################
5738DimPerson=PersonFrame1
5739DimPersonIndex=DimPerson.set_index(['Indicator'],inplace=False)
5740################################################################
5741sTable = 'Dim-BMI-Vertical'
5742print('\n#################################')
5743print('Storing :',sDatabaseName,'\n Table:',sTable)
5744print('\n#################################')
5745DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
5746################################################################
5747print('################################')
5748sTable = 'Dim-BMI-Vertical'
5749print('Loading :',sDatabaseName,' Table:',sTable)
5750print('################################')
5751sSQL="SELECT * FROM [Dim-BMI-Vertical];"
5752PersonFrame2=pd.read_sql_query(sSQL, conn2)
5753################################################################
5754print('################################')
5755print('Full Data Set (Rows):', PersonFrame0.shape[0])
5756print('Full Data Set (Columns):', PersonFrame0.shape[1])
5757print('################################')
5758print('Horizontal Data Set (Rows):', PersonFrame2.shape[0])
5759print('Horizontal Data Set (Columns):', PersonFrame2.shape[1])
5760print('################################')
5761################################################################
5762Output:
5763
5764
5765
5766
5767
5768
5769
5770
5771
5772
5773
5774
5775
5776
5777 D. Secure Vault Style
5778
5779The secure vault is a version of one of the horizontal, vertical, or island slicing techniques, but the outcome is also attached to the person who performs the query. This is common in multi-security environments, where different users are allowed to see different data sets.This process works well, if you use a role-based access control (RBAC) approach to restricting system access to authorized users. The security is applied against the “role,” and a person can then, by the security system,simply be added or removed from the role, to enable or disable access.
5780
5781C:\VKHCG\01-Vermeulen\05-Organise\ Organize-Secure-Vault.py
5782################################################################
5783# -*- coding: utf-8 -*-
5784################################################################
5785import sys
5786import os
5787import pandas as pd
5788import sqlite3 as sq
5789################################################################
5790Base='C:/VKHCG'
5791print('################################')
5792print('Working Base :',Base, ' using ', sys.platform)
5793print('################################')
5794################################################################
5795################################################################
5796Company='01-Vermeulen'
5797################################################################
5798sDataWarehouseDir=Base + '/99-DW'
5799if not os.path.exists(sDataWarehouseDir):
5800os.makedirs(sDataWarehouseDir)
5801################################################################
5802sDatabaseName=sDataWarehouseDir + '/datawarehouse.db'
5803conn1 = sq.connect(sDatabaseName)
5804################################################################
5805sDatabaseName=sDataWarehouseDir + '/datamart.db'
5806conn2 = sq.connect(sDatabaseName)
5807################################################################
5808print('################')
5809sTable = 'Dim-BMI'
5810print('Loading :',sDatabaseName,' Table:',sTable)
5811sSQL="SELECT * FROM [Dim-BMI];"
5812PersonFrame0=pd.read_sql_query(sSQL, conn1)
5813################################################################
5814print('################')
5815sTable = 'Dim-BMI'
5816print('Loading :',sDatabaseName,' Table:',sTable)
5817sSQL="SELECT \Height,\Weight,\Indicator,\CASE Indicator\WHEN 1 THEN 'Pip'\WHEN 2 THEN 'Norman'\WHEN 3 THEN 'Grant'\ELSE 'Sam'\END AS Name\FROM [Dim-BMI]\WHERE Indicator > 2\ORDER BY \Height,\Weight;"
5818PersonFrame1=pd.read_sql_query(sSQL, conn1)
5819################################################################
5820DimPerson=PersonFrame1
5821DimPersonIndex=DimPerson.set_index(['Indicator'],inplace=False)
5822################################################################
5823sTable = 'Dim-BMI-Secure'
5824print('\n#################################')
5825print('Storing :',sDatabaseName,'\n Table:',sTable)
5826print('\n#################################')
5827DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
5828################################################################
5829print('################################')
5830sTable = 'Dim-BMI-Secure'
5831print('Loading :',sDatabaseName,' Table:',sTable)
5832print('################################')
5833sSQL="SELECT * FROM [Dim-BMI-Secure] WHERE Name = 'Sam';"
5834PersonFrame2=pd.read_sql_query(sSQL, conn2)
5835################################################################
5836print('################################')
5837print('Full Data Set (Rows):', PersonFrame0.shape[0])
5838print('Full Data Set (Columns):', PersonFrame0.shape[1])
5839print('################################')
5840print('Horizontal Data Set (Rows):', PersonFrame2.shape[0])
5841print('Horizontal Data Set (Columns):', PersonFrame2.shape[1])
5842print('Only Sam Data')
5843print(PersonFrame2.head())
5844print('################################')
5845################################################################
5846Output:
5847
5848
5849
5850E. Create a Network Routing Diagram
5851
5852C:\VKHCG\01-Vermeulen\05-Organise\ Organise-Network-Routing-Company.py
5853################################################################
5854import sys
5855import os
5856import pandas as pd
5857import networkx as nx
5858import matplotlib.pyplot as plt
5859################################################################
5860pd.options.mode.chained_assignment = None
5861################################################################
5862Base='C:/VKHCG'
5863################################################################
5864print('################################')
5865print('Working Base :',Base, ' using ', sys.platform)
5866print('################################')
5867################################################################
5868sInputFileName='02-Assess/01-EDS/02-Python/Assess-Network-Routing-Company.csv'
5869################################################################
5870sOutputFileName1='05-Organise/01-EDS/02-Python/Organise-Network-Routing-Company.gml'
5871sOutputFileName2='05-Organise/01-EDS/02-Python/Organise-Network-Routing-Company.png'
5872Company='01-Vermeulen'
5873################################################################
5874################################################################
5875### Import Country Data
5876################################################################
5877sFileName=Base + '/' + Company + '/' + sInputFileName
5878print('################################')
5879print('Loading :',sFileName)
5880print('################################')
5881CompanyData=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
5882print('################################')
5883################################################################
5884print(CompanyData.head())
5885print(CompanyData.shape)
5886################################################################
5887G=nx.Graph()
5888for i in range(CompanyData.shape[0]):
5889for j in range(CompanyData.shape[0]):
5890Node0=CompanyData['Company_Country_Name'][i]
5891Node1=CompanyData['Company_Country_Name'][j]
5892if Node0 != Node1:
5893G.add_edge(Node0,Node1)
5894for i in range(CompanyData.shape[0]):
5895Node0=CompanyData['Company_Country_Name'][i]
5896Node1=CompanyData['Company_Place_Name'][i] + '('+ CompanyData['Company_Country_Name'][i] + ')'
5897if Node0 != Node1:
5898G.add_edge(Node0,Node1)
5899print('Nodes:', G.number_of_nodes())
5900print('Edges:', G.number_of_edges())
5901################################################################
5902sFileName=Base + '/' + Company + '/' + sOutputFileName1
5903print('################################')
5904print('Storing :',sFileName)
5905print('################################')
5906nx.write_gml(G, sFileName)
5907################################################################
5908sFileName=Base + '/' + Company + '/' + sOutputFileName2
5909print('################################')
5910print('Storing Graph Image:',sFileName)
5911print('################################')
5912plt.figure(figsize=(15, 15))
5913pos=nx.spectral_layout(G,dim=2)
5914nx.draw_networkx_nodes(G,pos, node_color='k', node_size=10, alpha=0.8)
5915nx.draw_networkx_edges(G, pos,edge_color='r', arrows=False, style='dashed')
5916nx.draw_networkx_labels(G,pos,font_size=12,font_family='sans-serif',font_color='b')
5917plt.axis('off')
5918plt.savefig(sFileName,dpi=600)
5919plt.show()
5920################################################################
5921print('################################')
5922print('### Done!! #####################')
5923print('################################')
5924Output:
5925
5926
5927
5928
5929
5930
5931
5932
5933F. Krennwallner ---Picking Content for Billboards
5934
5935To enable the marketing salespeople to sell billboard content, they will require a diagram to show what
5936billboards connect to which office content publisher. Each of Krennwallner’s billboards has a proximity sensor that enables the content managers to record when a registered visitor points his/her smartphone at the billboard content or touches the near-field pad with a mobile phone.Program will assist you in building an organized graph of the billboards’ locations data to help you to gain insights into the billboard locations and content picking process.
5937C:\VKHCG\02-Krennwallner\05-Organise\ Organise-billboards.py
5938################################################################
5939import sys
5940import os
5941import pandas as pd
5942import networkx as nx
5943import matplotlib.pyplot as plt
5944import numpy as np
5945################################################################
5946pd.options.mode.chained_assignment = None
5947################################################################
5948Base='C:/VKHCG'
5949################################################################
5950print('################################')
5951print('Working Base :',Base, ' using ', sys.platform)
5952print('################################')
5953################################################################
5954sInputFileName='02-Assess/01-EDS/02-Python/Assess-DE-Billboard-Visitor.csv'
5955################################################################
5956sOutputFileName1='05-Organise/01-EDS/02-Python/Organise-Billboards.gml'
5957sOutputFileName2='05-Organise/01-EDS/02-Python/Organise-Billboards.png'
5958Company='02-Krennwallner'
5959################################################################
5960################################################################
5961### Import Company Data
5962################################################################
5963sFileName=Base + '/' + Company + '/' + sInputFileName
5964print('################################')
5965print('Loading :',sFileName)
5966print('################################')
5967BillboardDataRaw=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
5968print('################################')
5969################################################################
5970print(BillboardDataRaw.head())
5971print(BillboardDataRaw.shape)
5972BillboardData=BillboardDataRaw
5973sSample=list(np.random.choice(BillboardData.shape[0],20))
5974###############################################################
5975G=nx.Graph()
5976for i in sSample:
5977for j in sSample:
5978Node0=BillboardData['BillboardPlaceName'][i] + '('+ BillboardData['BillboardCountry'][i] + ')'
5979Node1=BillboardData['BillboardPlaceName'][j] + '('+ BillboardData['BillboardCountry'][i] + ')'
5980if Node0 != Node1:
5981G.add_edge(Node0,Node1)
5982for i in sSample:
5983Node0=BillboardData['BillboardPlaceName'][i] + '('+ BillboardData['VisitorPlaceName'][i] + ')'
5984Node1=BillboardData['BillboardPlaceName'][i] + '('+ BillboardData['VisitorCountry'][i] + ')'
5985if Node0 != Node1:
5986G.add_edge(Node0,Node1)
5987print('Nodes:', G.number_of_nodes())
5988print('Edges:', G.number_of_edges())
5989################################################################
5990sFileName=Base + '/02-Krennwallner/' + sOutputFileName1
5991print('################################')
5992print('Storing :',sFileName)
5993print('################################')
5994nx.write_gml(G, sFileName)
5995################################################################
5996sFileName=Base + '/02-Krennwallner/' + sOutputFileName2
5997print('################################')
5998print('Storing Graph Image:',sFileName)
5999print('################################')
6000plt.figure(figsize=(15, 15))
6001pos=nx.circular_layout(G,dim=2)
6002nx.draw_networkx_nodes(G,pos, node_color='k', node_size=150, alpha=0.8)
6003nx.draw_networkx_edges(G, pos,edge_color='r', arrows=False, style='solid')
6004nx.draw_networkx_labels(G,pos,font_size=12,font_family='sans-serif',font_color='b')
6005plt.axis('off')
6006plt.savefig(sFileName,dpi=600)
6007plt.show()
6008################################################################
6009print('################################')
6010print('### Done!! #####################')
6011print('################################')
6012Output :
6013
6014
6015G. Hillman ---Create a Delivery Route
6016
6017Hillman requires a new delivery route plan from HQ-KA13’s delivery region. Themanaging director has to
6018know the following:
6019• What his most expensive route is, if the cost is £1.50 per mile and twotrips are planned per day
6020• What the average travel distance in miles is for the region per 30-daymonth
6021With your newfound knowledge in building the technology stack for turning datalakes into business assets, can you convert the graph stored in the Assess step called“Assess_Best_Logistics” into the shortest path between the two points?
6022C:\VKHCG\03-Hillman\05-Organise\Organise-Routes.py
6023# -*- coding: utf-8 -*-
6024################################################################
6025import sys
6026import os
6027import pandas as pd
6028################################################################
6029Base='C:/VKHCG'
6030################################################################
6031print('################################')
6032print('Working Base :',Base, ' using ', sys.platform)
6033print('################################')
6034################################################################
6035sInputFileName='02-Assess/01-EDS/02-Python/Assess_Shipping_Routes.txt'
6036################################################################
6037sOutputFileName='05-Organise/01-EDS/02-Python/Organise-Routes.csv'
6038Company='03-Hillman'
6039################################################################
6040################################################################
6041### Import Routes Data
6042################################################################
6043sFileName=Base + '/' + Company + '/' + sInputFileName
6044print('################################')
6045print('Loading :',sFileName)
6046print('################################')
6047RouteDataRaw=pd.read_csv(sFileName,header=0,low_memory=False, sep='|', encoding="latin-1")
6048print('################################')
6049################################################################
6050RouteStart=RouteDataRaw[RouteDataRaw['StartAt']=='WH-KA13']
6051################################################################
6052RouteDistance=RouteStart[RouteStart['Cost']=='DistanceMiles']
6053RouteDistance=RouteDistance.sort_values(by=['Measure'], ascending=False)
6054################################################################
6055RouteMax=RouteStart["Measure"].max()
6056RouteMaxCost=round((((RouteMax/1000)*1.5*2)),2)
6057print('################################')
6058print('Maximum (£) per day:')
6059print(RouteMaxCost)
6060print('################################')
6061################################################################
6062RouteMean=RouteStart["Measure"].mean()
6063RouteMeanMonth=round((((RouteMean/1000)*2*30)),6)
6064print('################################')
6065print('Mean per Month (Miles):')
6066print(RouteMeanMonth)
6067print('################################')
6068Output:
6069
6070
6071
6072
6073
6074
6075
6076
6077
6078
6079
6080
6081
6082
6083
6084
6085
6086Practical 9
6087Aim:Generating Data
6088
6089Vermeulen PLC
6090Vermeulen requires a map of all their customers’ data links. Can you provide a report to deliver this? I will
6091guide you through an example that delivers this requirement.
6092C:\VKHCG\01-Vermeulen\06-Report\Raport-Network-Routing-Customer.py
6093################################################################
6094import sys
6095import os
6096import pandas as pd
6097import networkx as nx
6098import matplotlib.pyplot as plt
6099################################################################
6100pd.options.mode.chained_assignment = None
6101################################################################
6102if sys.platform == 'linux':
6103Base=os.path.expanduser('~') + 'VKHCG'
6104else:
6105Base='C:/VKHCG'
6106################################################################
6107print('################################')
6108print('Working Base :',Base, ' using ', sys.platform)
6109print('################################')
6110################################################################
6111sInputFileName='02-Assess/01-EDS/02-Python/Assess-Network-Routing-Customer.csv'
6112################################################################
6113sOutputFileName1='06-Report/01-EDS/02-Python/Report-Network-Routing-Customer.gml'
6114sOutputFileName2='06-Report/01-EDS/02-Python/Report-Network-Routing-Customer.png'
6115Company='01-Vermeulen'
6116################################################################
6117################################################################
6118### Import Country Data
6119################################################################
6120sFileName=Base + '/' + Company + '/' + sInputFileName
6121print('################################')
6122print('Loading :',sFileName)
6123print('################################')
6124CustomerDataRaw=pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
6125CustomerData=CustomerDataRaw.head(100)
6126print('Loaded Country:',CustomerData.columns.values)
6127print('################################')
6128################################################################
6129print(CustomerData.head())
6130print(CustomerData.shape)
6131################################################################
6132G=nx.Graph()
6133for i in range(CustomerData.shape[0]):
6134for j in range(CustomerData.shape[0]):
6135Node0=CustomerData['Customer_Country_Name'][i]
6136Node1=CustomerData['Customer_Country_Name'][j]
6137if Node0 != Node1:
6138G.add_edge(Node0,Node1)
6139for i in range(CustomerData.shape[0]):
6140Node0=CustomerData['Customer_Country_Name'][i]
6141Node1=CustomerData['Customer_Place_Name'][i] + '('+ CustomerData['Customer_Country_Name'][i] + ')'
6142Node2='('+ "{:.9f}".format(CustomerData['Customer_Latitude'][i]) + ')\
6143('+ "{:.9f}".format(CustomerData['Customer_Longitude'][i]) + ')'
6144if Node0 != Node1:
6145G.add_edge(Node0,Node1)
6146if Node1 != Node2:
6147G.add_edge(Node1,Node2)
6148print('Nodes:', G.number_of_nodes())
6149print('Edges:', G.number_of_edges())
6150################################################################
6151sFileName=Base + '/' + Company + '/' + sOutputFileName1
6152print('################################')
6153print('Storing :',sFileName)
6154print('################################')
6155nx.write_gml(G, sFileName)
6156################################################################
6157sFileName=Base + '/' + Company + '/' + sOutputFileName2
6158print('################################')
6159print('Storing Graph Image:',sFileName)
6160print('################################')
6161plt.figure(figsize=(25, 25))
6162pos=nx.spectral_layout(G,dim=2)
6163nx.draw_networkx_nodes(G,pos, node_color='k', node_size=10, alpha=0.8)
6164nx.draw_networkx_edges(G, pos,edge_color='r', arrows=False, style='dashed')
6165nx.draw_networkx_labels(G,pos,font_size=12,font_family='sans-serif',font_color='b')
6166plt.axis('off')
6167plt.savefig(sFileName,dpi=600)
6168plt.show()
6169print('################################')
6170print('### Done!! #####################')
6171print('################################')
6172
6173
6174
6175
6176
6177
6178
6179
6180Output:
6181
6182
6183Krennwallner AG
6184The Krennwallner marketing department wants to deploy the locations of the billboards
6185onto the company web server. Can you prepare three versions of the locations’ web
6186pages?
6187• Locations clustered into bubbles when you zoom out
6188• Locations as pins
6189• Locations as heat map
6190Picking Content for Billboards
6191C:\VKHCG\02-Krennwallner\06-Report\Report_Billboard.py
6192################################################################
6193# -*- coding: utf-8 -*-
6194################################################################
6195import sys
6196import os
6197import pandas as pd
6198from folium.plugins import FastMarkerCluster, HeatMap
6199from folium import Marker, Map
6200import webbrowser
6201################################################################
6202Base='C:/VKHCG'
6203print('################################')
6204print('Working Base :',Base, ' using ', sys.platform)
6205print('################################')
6206################################################################
6207sFileName=Base+'/02-Krennwallner/01-Retrieve/01-EDS/02-Python/Retrieve_DE_Billboard_Locations.csv'
6208df = pd.read_csv(sFileName,header=0,low_memory=False, encoding="latin-1")
6209df.fillna(value=0, inplace=True)
6210print(df.shape)
6211################################################################
6212t=0
6213for i in range(df.shape[0]):
6214try:
6215sLongitude=df["Longitude"][i]
6216sLongitude=float(sLongitude)
6217except Exception:
6218sLongitude=float(0.0)
6219try:
6220sLatitude=df["Latitude"][i]
6221sLatitude=float(sLatitude)
6222except Exception:
6223sLatitude=float(0.0)
6224try:
6225sDescription=df["Place_Name"][i] + ' (' + df["Country"][i]+')'
6226except Exception:
6227sDescription='VKHCG'
6228if sLongitude != 0.0 and sLatitude != 0.0:
6229DataClusterList=list([sLatitude, sLongitude])
6230DataPointList=list([sLatitude, sLongitude, sDescription])
6231t+=1
6232if t==1:
6233DataCluster=[DataClusterList]
6234DataPoint=[DataPointList]
6235else:
6236DataCluster.append(DataClusterList)
6237DataPoint.append(DataPointList)
6238data=DataCluster
6239pins=pd.DataFrame(DataPoint)
6240pins.columns = [ 'Latitude','Longitude','Description']
6241################################################################
6242stops_map1 = Map(location=[48.1459806, 11.4985484], zoom_start=5)
6243marker_cluster = FastMarkerCluster(data).add_to(stops_map1)
6244sFileNameHtml=Base+'/02-Krennwallner/06-Report/01-EDS/02-Python/Billboard1.html'
6245stops_map1.save(sFileNameHtml)
6246webbrowser.open('file://' + os.path.realpath(sFileNameHtml))
6247################################################################
6248stops_map2 = Map(location=[48.1459806, 11.4985484], zoom_start=5)
6249for name, row in pins.iloc[:100].iterrows():
6250Marker([row["Latitude"],row["Longitude"]], popup=row["Description"]).add_to(stops_map2)
6251sFileNameHtml=Base+'/02-Krennwallner/06-Report/01-EDS/02-Python/Billboard2.html'
6252stops_map2.save(sFileNameHtml)
6253webbrowser.open('file://' + os.path.realpath(sFileNameHtml))
6254################################################################
6255stops_heatmap = Map(location=[48.1459806, 11.4985484], zoom_start=5)
6256stops_heatmap.add_child(HeatMap([[row["Latitude"], row["Longitude"]] for name, row in
6257pins.iloc[:100].iterrows()]))
6258sFileNameHtml=Base+'/02-Krennwallner/06-Report/01-EDS/02-Python/Billboard_heatmap.html'
6259stops_heatmap.save(sFileNameHtml)
6260webbrowser.open('file://' + os.path.realpath(sFileNameHtml))
6261################################################################
6262print('### Done!! ############################################')
6263################################################################
6264Output:
6265
6266
6267
6268
6269
6270Graphics
6271This section will now guide you through a number of visualizations that particularly useful in presenting data to my customers.
6272C:\VKHCG\01-Vermeulen\06-Report\Report_Graph_A.py
6273################################################################
6274# -*- coding: utf-8 -*-
6275################################################################
6276import sys
6277import os
6278import pandas as pd
6279import matplotlib as ml
6280from matplotlib import pyplot as plt
6281################################################################
6282if sys.platform == 'linux':
6283 Base=os.path.expanduser('~') + '/VKHCG'
6284else:
6285 Base='C:/VKHCG'
6286print('################################')
6287print('Working Base :',Base, ' using ', sys.platform)
6288print('################################')
6289################################################################
6290GBase = Base+'/01-Vermeulen/06-Report/01-EDS/02-Python/'
6291ml.style.use('ggplot')
6292data=[['London', 29.2, 17.4],['Glasgow', 18.8, 11.3],['Cape Town', 15.3, 9.0],['Houston',22.0, 7.8],['Perth', 18.0, 23.7],['San Francisco', 11.4, 33.3]]
6293os_new=pd.DataFrame(data)
6294pd.Index(['Item', 'Value', 'Value Percent', 'Conversions', 'Conversion Percent','URL', 'Stats URL'],dtype='object')
6295os_new.rename(columns = {0 : "Warehouse Location"}, inplace=True)
6296os_new.rename(columns = {1 : "Profit 2016"}, inplace=True)
6297os_new.rename(columns = {2 : "Profit 2017"}, inplace=True)
6298explode = (0, 0, 0, 0, 0, 0.1)
6299labels=os_new['Warehouse Location']
6300colors_mine = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral', 'lightcyan','lightblue']
6301os_new.plot(figsize=(10,10),kind="pie",y="Profit2017",autopct='%.2f%%',\shadow=True,explode=explode, legend = False, colors = colors_mine,\labels=labels, fontsize=20)
6302sPicNameOut1=GBase+'pie_explode.png'
6303plt.tight_layout()
6304plt.savefig(sPicNameOut1,dpi=600)
6305explode = (0, 0, 0, 0, 0, 0)
6306colors_mine = ['yellowgreen', 'gold', 'lightskyblue', 'lightcoral', 'lightcyan','lightblue']
6307os_new.plot(figsize=(10,5),kind="pie",y=['Profit 2016','Profit 2017'],autopct='%.2f%%',\shadow=True, explode=explode, legend = False, colors = colors_mine,\subplots=True, labels=labels, fontsize=10)
6308sPicNameOut2=GBase+'pie.png'
6309plt.tight_layout()
6310plt.savefig(sPicNameOut2,dpi=600)
6311os_new.iloc[:5].plot(figsize=(10, 10),kind='Line',x='Warehouse Location',\y=['Profit 2016','Profit 2017']);
6312sPicNameOut3=GBase+'line.png'
6313plt.tight_layout()
6314plt.savefig(sPicNameOut3,dpi=600)
6315os_new.iloc[:5].plot(figsize=(10, 10),kind='bar',x='Warehouse Location',\y=['Profit 2016','Profit 2017']);
6316sPicNameOut4=GBase+'bar.png'
6317plt.tight_layout()
6318plt.savefig(sPicNameOut4,dpi=600)
6319os_new.iloc[:5].plot(figsize=(10, 10),kind='barh',x='Warehouse Location',\y=['Profit 2016','Profit 2017']);
6320sPicNameOut5=GBase+'hbar.png'
6321plt.tight_layout()
6322plt.savefig(sPicNameOut5,dpi=600)
6323os_new.iloc[:5].plot(figsize=(10,10),kind='area',x='WarehouseLocation',\y=['Profit2016','Profit2017'],stacked=False);
6324sPicNameOut6=GBase+'area.png'
6325plt.tight_layout()
6326plt.savefig(sPicNameOut6,dpi=600)
6327os_new.iloc[:5].plot(figsize=(10,10),kind='scatter',x='Profit2016',\y='Profit2017',color='DarkBlue',markerD');
6328sPicNameOut7=GBase+'scatter.png'
6329plt.tight_layout()
6330plt.savefig(sPicNameOut7,dpi=600)
6331os_new.iloc[:5].plot(figsize=(13, 10),kind='hexbin',x='Profit 2016',\y='Profit 2017', gridsize=25);
6332sPicNameOut8=GBase+'hexbin.png'
6333plt.tight_layout()
6334plt.savefig(sPicNameOut8,dpi=600)
6335
6336
6337
6338
6339
6340
6341
6342Pie Graph--Double Pie
6343
6344Line Graph
6345
6346
6347
6348Bar Graph / Horizontal Bar Graph
6349
6350
6351
6352Area Graph
6353
6354
6355
6356Scatter Graph
6357
6358
6359
6360
6361Kernel Density Estimation (KDE) Graph
6362C:\VKHCG\01-Vermeulen\06-Report\Report_Graph_B.py
6363
6364################################################################
6365# -*- coding: utf-8 -*-
6366################################################################
6367import sys
6368import os
6369import pandas as pd
6370import matplotlib as ml
6371import numpy as np
6372from matplotlib import pyplot as plt
6373################################################################
6374if sys.platform == 'linux':
6375 Base=os.path.expanduser('~') + '/VKHCG'
6376else:
6377 Base='C:/VKHCG'
6378print('################################')
6379print('Working Base :',Base, ' using ', sys.platform)
6380print('################################')
6381################################################################
6382ml.style.use('ggplot')
6383fig1=plt.figure(figsize=(10, 10))
6384ser = pd.Series(np.random.randn(1000))
6385ser.plot(figsize=(10, 10),kind='kde')
6386sPicNameOut1=Base+'/01-Vermeulen/06-Report/01-EDS/02-Python/kde.png'
6387plt.savefig(sPicNameOut1,dpi=600)
6388plt.tight_layout()
6389plt.show()
6390fig2=plt.figure(figsize=(10, 10))
6391from pandas.plotting import scatter_matrix
6392df = pd.DataFrame(np.random.randn(1000, 5), columns=['Y2014','Y2015', 'Y2016', 'Y2017', 'Y2018'])
6393scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')
6394sPicNameOut2=Base+'/01-Vermeulen/06-Report/01-EDS/02-Python/scatter_matrix.png'
6395plt.savefig(sPicNameOut2,dpi=600)
6396plt.tight_layout()
6397plt.show()
6398Output:
6399
6400
6401
6402
6403
6404
6405
6406
6407
6408Practical 10
6409
6410Aim:Data Visualization with Power BI
6411Case Study : Sales Data
6412Step1:Connect to an excel workbook
6413
6414
6415
6416
6417
6418You can also open the Query Editor by selecting Edit Queries from the Home ribbon in Power BI Desktop. The following steps are performed in Query Editor.
6419
6420In Query Editor, select the ProductID, ProductName, QuantityPerUnit, and UnitsInStock
6421Columns (use Ctrl+Click to select more than one column, or Shift+Click to select columns that are
6422beside each other)
6423
6424Step 2. Select Remove ColumnsRemove Other Columns from the ribbon, or right-click on a column header and click Remove Other Columns.
6425
6426
6427
6428
6429
6430Step 3: Change the data type of the UnitsInStock column
6431For the Excel workbook, products in stock will always be a whole number, so in this step you confirm the UnitsInStock column’s datatype is Whole Number.
64321. Select the UnitsInStock column.
64332. 2. Select the Data Type drop-down button in the Home ribbon.
64343. If not already a Whole Number, select Whole Number for data type from the drop down (the
6435Data Type: button also displays the data type for the current selection).
6436
6437
6438Task 2: Import order data from an OData feed
6439
6440You import data into Power BI Desktop from the sample Northwind OData feed at the following URL, which you can copy (and then paste) in the steps below:
6441http://services.odata.org/V3/Northwind/Northwind.svc/
6442
6443Step 1: Connect to an OData feed
6444
64451. From the Home ribbon tab in Query Editor, select Get Data.
64462. Browse to the OData Feed data source.
64473. In the OData Feed dialog box, paste the URL for the Northwind OData feed.
64484. Select OK.
6449
6450
6451Step 2: Expand the Order_Details table
6452
6453
6454
6455
6456Expand the Order_Details table that is related to the Orders table, to combine the ProductID, UnitPrice, and Quantity columns from Order_Details into the Orders table.
6457The Expand operation combines columns from a related table into a subject table. When the query runs, rows from the related table (Order_Details) are combined into rows from the subject table
6458(Orders).
6459
6460After you expand the Order_Details table, three new columns and additional rows are added to the Orders table, one for each row in the nested or related table.
64611. In the Query View, scroll to the Order_Details column.
64622. In the Order_Details column, select the expand icon ().
64633. In the Expand drop-down: a. Select (Select All Columns) to clear all columns.
6464Select ProductID, UnitPrice, and Quantity.
6465click OK.
6466
6467
6468
6469
6470
6471Step 3: Remove other columns to only display columns of interest
6472
6473In this step you remove all columns except OrderDate, ShipCity, ShipCountry, Order_Details.ProductID, Order_Details.UnitPrice, and Order_Details.Quantity columns. In the previous task, you used Remove Other Columns. For this task, you remove selected columns.
6474In the Query View, select all columns by completing a.
6475a. Click the first column (OrderID).
6476b. Shift+Click the last column (Shipper).
6477c. Now that all columns are selected, use Ctrl+Click to unselect the following columns:
6478OrderDate, ShipCity, ShipCountry, Order_Details.ProductID, Order_Details.UnitPrice, and
6479Order_Details.Quantity.
6480Now that only the columns we want to remove are selected, right-click on any selected column header and click Remove Columns.
6481
6482Step 4: Calculate the line total for each Order_Details row
6483Power BI Desktop lets you to create calculations based on the columns you are importing, so you can enrich the data that you connect to. In this step, you create a Custom Column to calculate the line total for each Order_Details row.
6484
6485Calculate the line total for each Order_Details row:
6486
64871. In the Add Column ribbon tab, click Add Custom Column.
64882. In the Add Custom Column dialog box, in the Custom Column Formula textbox, enter
6489[Order_Details.UnitPrice] * [Order_Details.Quantity].
64903. In the New column name textbox, enter LineTotal.
6491
6492
6493
6494
6495
6496
6497
6498
6499
6500
6501
6502
6503
6504
6505
6506
6507Step 5: Set the datatype of the LineTotal field
6508
65091. Right click the LineTotal column.
65102. 2. Select Change Type and choose Decimal Number.
6511
6512
6513
6514Step 6: Rename and reorder columns in the query
6515
65161. In Query Editor, drag the LineTotal column to the left, after ShipCountry.
6517
6518
6519
6520
6521
6522
6523
65242. Remove the Order_Details. prefix from the Order_Details.ProductID, Order_Details.UnitPrice
6525and Order_Details.Quantity columns, by double-clicking on each column header, and then deleting that text from the column name
6526
6527
6528
6529
6530Task 3: Combine the Products and Total Sales queries
6531
6532
6533
65342. Power BI Desktop loads the data from the two queries
65353. Once the data is loaded, select the Manage Relationships button Home ribbon
65364. Select the New… button
65375. When we attempt to create the relationship, we see that one already exists! As shown in the Create Relationship dialog (by the shaded columns), the ProductsID fields in each query
6538already have an established relationship.
6539
6540
6541
6542
6543
65446. Select Cancel, and then select Relationship view in Power BI Desktop.
6545
6546
6547
6548
6549
6550
6551
6552Task 4: Build visuals using your data
6553
6554
6555Step 1: Create charts showing Units in Stock by Product and Total Sales by Year
6556
6557
6558
6559
6560
6561
6562Step2:Next, drag ShipCountry to a space on the canvas in the top right. Because you selected a geographic field, a map was created automatically. Now drag LineTotal to the Values field; the circles on the map for each country are now relative in size to the LineTotal for orders shipped to that country.
6563
6564
6565
6566
6567
6568
6569
6570
6571
6572
6573
6574Step3:Interact with your report visuals to analyze further