· last year · Jul 28, 2024, 02:00 PM
1{% extends "../base.html" %}
2
3
4{% block style %}
5<link rel="stylesheet" type="text/css" href="/media/stylesheets/helper.css" />
6{% endblock %}
7
8{% block container %}
9<div class="main_container">
10 {% include '../helper_sidebar.html'%}
11
12 <div class="info_container">
13 <div id="main" style="padding: 15px;">
14 <p>Over the years, data has gradually evolved to become a crucial element in business decision-making. Advances in technology have revolutionized the ways data is collected, stored, analyzed, and utilized.
15 </p>
16 <p><b>Data Katalyst </b> comprises multiple components that independently streamline these processes. It is designed around a No/Low code concept, simplifying the development process. Users can configure the required input, which then auto-generates and executes the necessary code using Airflow for data analysis.
17 <p>The application relies heavily on the configuration of its various components. This foundational process ensures that code is generated based on the set configurations. </p>
18 </p>
19
20
21 </div>
22 <div id="connection" style="padding: 15px;">
23 <h3>1. Settings</h3>
24 <p>The Settings allows users to connect to external data sources. These settings can be shared across all projects or specific ones using the ‘Share’ option.
25 Below are the available parameters which can be configured.
26 </p>
27
28 <p><b>Connection: </b>
29 Users can connect to the following external sources:
30 </p>
31 <p><i>Local Store: </i> This is the default selection under the option Type. Users should provide the local directory path where the source files are stored. </p>
32 <p class="p1"> Name: Field to name the connection</p>
33 <p> Path: Field to specify the directory path.</p>
34
35
36 <p><i>HDFS: </i>
37 For files stored in HDFS, select the ‘HDFS’ option under Type and configure the following fields:
38 </p>
39 <p class="p1"> Name: Field to specify the connection name.</p>
40 <p> URL: Field to provide the HDFS URL.</p>
41
42 <p><i>Amazon S3: </i>
43 For files stored in an S3 bucket, select the ‘Amazon S3’ option and configure the following fields:
44 </p>
45 <p class="p1"> Name: Field to specify the connection name.</p>
46 <p> URL: Field to provide the S3 URL.</p>
47
48 <p><i>DB connections: </i>
49 For reading data from a database, select the ‘Database’ option and configure the following fields:
50 </p>
51 <div class="para">
52 <p> Name: Field to specify the connection name.</p>
53 <p> URL: Field to specify the JDBC connection URL.</p>
54 <p> Database Name: Field to specify the database name.</p>
55 <p> Port: Field to specify the port if not included in the JDBC connection URL.
56 </p>
57 </div>
58 <p><b>Pandas: </b>
59 Options to control Pandas settings.
60 </p>
61 <p><b>Environment Variables: </b>
62 Options to set environment variables.
63 </p>
64 <p><b>Kafka Streaming: </b>
65 Options to set Kafka related configurations such as topic.
66 </p>
67 <p><b>General: </b>
68 Configurations applicable to all the data nodes will be set here.
69 </p>
70 <p><b>Spark: </b>
71 Configurations applicable to all the data nodes will be set here.
72 Spark: Keys applicable to the Spark application, such as:
73 'spark.executor.memory'
74 'spark.app.name'
75 'spark.driver.memory'
76 </p>
77
78 </div>
79 <!-- <div id="parameter" style="padding: 15px;">
80 <h3>2. Parameters</h3>
81 <p>Parameters are static values in the form of key-value pairs used to customize the external application’s configuration. They also enable users to set paths for storing generated code files. The available parameters include:
82 </p>
83 <ul>Spark: Keys applicable to the Spark application, such as:
84 'spark.executor.memory'
85 'spark.app.name'
86 'spark.driver.memory'
87 </ul>
88 <ul>Pandas: Options to control Pandas settings.</ul>
89 <ul>Environment: Options to set environment variables.</ul>
90 <ul>Settings: Project-level settings, including paths required to store generated code files.</ul>
91 </div> -->
92
93 <div id="datanode" style="padding: 15px;">
94 <h3>2. Data Node</h3>
95 <p>A Data Node holds the data structure of tables and can be categorized as Input, Target, Reference, Logical DataNode or Intermediary against a connection name defined in Connection settings.</p>
96 <p><b>Subject Area: </b>
97 Free text field to specify a name and group the identical data nodes.</p>
98 <p><b>Name: </b>
99 Indicates from where the data to be read, mention the table name if the connection is from a database, or the file name (including the extension) if the connection is from sources like Local Store, HDFS, or Amazon S3.</p>
100 <p><b>Source Category: </b>
101 Indicates the type of data node to store the structure of the columns:</p>
102 <p>Input: Treats it as a source from where the data will be read.</p>
103 <p>Output: Used in transformation to store the target data</p>
104 <p>Reference: Indicates the tables which can be used as references such as country codes, currency codes etc.</p>
105 <p>Intermediary: Typically used for storing the transformation data in between Input and Output.</p>
106 <p>Logical DataNode: Used to create a view like by joining 2 datanodes to make a datanode in itself </p>
107 <p><b>Data Node Alias: </b>
108 An alternate name for the Data Node, used throughout the application.</p>
109 <p><b>Connection Name: </b>
110 A list of names from the Connections configuration, where the base path is set, allowing data retrieval.</p>
111
112 <p>For each Data Node, users can specify schema and parameters:
113 </p>
114 <p><b>Schema: </b>
115 Defines the properties of a data node. </p>
116 <ul><i>Column Name: </i>Indicates the name of a column in the data node.
117 </ul>
118 <ul><i>Roll Up: </i>Specifies the type of aggregation function to be performed when a GroupBy operation is applied to a column.</ul>
119 <ul><i>Foreign Key: </i>Specifies a foreign key relationship in the format ‘datanode.columnname’ if a column of the selected data node is a foreign key in another data node.</ul>
120 <ul><i>Mandatory: </i>Marks a column as mandatory if it should not be null.</ul>
121 <ul><i>Primary Key: </i> Specifies the primary key of the data node in the format ‘{Key}’.</ul>
122 <ul><i>Alternate Key: </i> Specifies an alternate primary key for the data node in the format ‘{Key1, Key2}’.</ul>
123
124 <p><b>Parameters: </b>
125 Values specific to a data node can be mentioned here such as business date, weightage etc.</p>
126
127 </div>
128 <div id="flows" style="padding: 15px;">
129 <h3>3. Flows</h3>
130 <p>Flows define how data is transferred from one point to another, based on the filters and aggregation functions that may be applied to the data node.</p>
131 <p><b>Flow Type:: </b>
132 Selection field to specify if the flow contains aggrgated columns or non aggregated columns</p>
133 <p><b>Flow Tag: </b>
134 Free text field to specify a name and group the identical flows.</p>
135 <p><b>Source: </b>
136 Indicates the data node from which the filters and aggregations to be performed.</p>
137 <p><b>Target: </b>
138 Indicates the data node where the data will be put post transformation.</p>
139 <p><b>Aggregate Columns: </b>
140 Specify the columns on which the GroupBy function to be applied.</p>
141 <p><b>Pre Aggregate Condition: </b>
142 Specifies the filter to be applied to the source data node before the aggregation function.</p>
143 <p><b>Post Aggregate Condition: </b>
144 Specifies the filter to be applied on the Target data node post the aggregation function.</p>
145 <p><b>Write mode </b>
146 Option to select if the output data to be added in the same file or override the existing file.</p>
147 <p><b>Column Mapping: </b>
148 Option to map the columns of source to correspondeing columns of the target</p>
149 <p><b>Target Column Name: </b>
150 List of columns available in the target data node</p>
151 <p><b>Target Column Expression: </b>
152 Enter the aggregation expression to be performed on the column of the source. for ex: sum('Col1')
153 User also can choose to copy the column from source as is to the target file without adding any expression. for ex: col('Col1') </p>
154
155
156 </div>
157 <div id="common-validation" style="padding: 15px;">
158 <h3>4. DQ Validations</h3>
159 <p>List of steps involved in the Data Quality process to verify and validate the data coming from the
160 source table. Below is the list of operations available to perform Data Quality check.
161 </p>
162 <ul>
163 <li>Common Validation</li>
164 <li>Business Validation</li>
165 <li>Data Recon</li>
166 <li>Data Profile</li>
167 </ul>
168 <p><b>Common Validation:</b>
169 Library of validations which are universal across the data collection to verify the column’s data is
170 in an expected format. System has a predefined set of validations which can be applied on a column
171 of a datanode. For ex.: Mobile Number, Email, Country Code.</p>
172 Following details are required to create an user defined validation.
173 <p><i> Validation Name: </i>
174 Indicates the name of the Validation which is being created.</p>
175 <p><i> Condition: </i>
176 Specify the condition according to the the data pattern expected.
177 Multiple validations can also be combined using AND / OR operators. Click on Add to specify the
178 operators and Save to save the validation.
179 </p>
180 <p><b>Business Validation: </b>
181 To verify if the data is in an expected format catering business needs.
182 Below details are required to add a business validation.
183 </p>
184 <p><i> Validation Name: </i>
185 Free text field to define a name of the validation being created.</p>
186 <p><i> Table Alias: </i>
187 Select a datanode from the dropdown on which the business validation to be applied.</p>
188 <p><i> Column Name: </i>
189 list of columns available in the selected data node on which the validation will be applied</p>
190 <p><i> Validation Types: </i>
191 Select a column name of the datanode chosen in the above step. This is to indicate on which column
192 the validation will be performed.</i>
193 <p><i> Execution Tag: </i>
194 Select a tag from the list to classify the data quality category for the validation being created
195 which will be reflected in the dashboards. </p>
196 <p><i> Failure Condition: </i>
197 Define a condition for a column, if the data fails to meet the specified validation criteria
198 the system will log the failure. The option is available when the Validation type is Business Value </p>
199 <p>Following Fields are required to check Consistency of a datanode:</p>
200 <p><i> Source Key: </i>
201 Specify a key whose values will be used as a foreign key in the reference table</p>
202 <p><i> Reference Table Alias: </i>
203 Select a datanode to be referenced to check the consistency of a column. </p>
204 <p><i> Reference Column: </i>
205 Specify the column from the reference table to be verified with the Column Name specified in the previous step. </p>
206 <p><i> Foreign Key: </i>
207 Foreign Key in Consistency ensures that all foreign key values in a data node match the Source key values in the
208 other data node</p>
209
210 <p><b> Data Recon</b></p>
211 <p>To verify if the correct set of data is flowing from source to destination datanode.</p>
212
213 <p><i> Flow Number: </i>
214 Select the flow created in the DataNode section.</p>
215 <p><i> Source Aggregated Column: </i>
216 Select a column from source datanode on which the ‘GroupBy’ aggregation to be performed.</p>
217 <p><i> Target Aggregated Column: </i>
218 Select a column from target datanode on which the ‘GroupBy’ aggregation to be performed. </p>
219 <p><i> Target Filter: </i>
220 Specify the filter which will be applied on the target datanode after the data is moved from source.
221 </p>
222 <p><i> Source Column Name: </i>
223 If a column name is different in source and target. Select the column from a source datanode which has to be mapped to the target datanode.</p>
224 <p><i> Target Column Name: </i>
225 If a column name is different in source and target. Select the column from a target datanode which has to be mapped to the source datanode selected
226 above.</p>
227 <p><i> Recon Measures: </i>
228 List of aggregation functions to be performed on the Source column.</p>
229 <p><i> Recon Tag: </i>
230 Free text field to group a set of identical Data flows.</p>
231
232 <p><b> Data Profile</b></p>
233 <p>Process of examining, analyzing, and creating useful summaries of data. The process yields a
234 high-level overview which aids in the discovery of data quality issues, risks, and overall trends.
235 </p>
236 <p><i> Data Node Alias: </i>
237 Select a datanode from the list on which the profiling has to be done</p>
238
239 <p><i> Profile Tag: </i>
240 Free text field to group a set of identical Data Profiles.</p>
241
242 <p><i> Group By: </i>
243 Select a or a set of columns from the list on which GroupBy has to be applied.</p>
244
245 <p><i> Column Name: </i>
246 Choose a column from the dropdown on which the profiling to be done.</p>
247
248 <p><i> Profile Measures: </i>
249 Select All or any of the profile measures based on the type of analysis user has to perform which helps an
250 organization in decision making</p>
251
252 </div>
253 <!-- <div id="dqsteps" style="padding: 15px;">
254 <h3>6. DQ STEPS</h3>
255 <p>List of steps involved in the Data Quality process to verify and validate the data coming from the
256 source table. Below is the list of operations available to perform Data Quality check.</p>
257 <ul>
258 <li>Schema Check</li>
259 <li>Business Validation</li>
260 <li>Data Recon</li>
261 <li>Data Profile</li>
262 </ul>
263 <p><b>Operation: </b>
264 Select a DataQuality operation to be performed on a data node or a set of datanodes.</p>
265 <p><b>Source Prefix: </b>
266 Name of the generated code file.</p>
267
268 </div> -->
269 <!-- <div id="schemacheck" style="padding: 15px;">
270 <h5>6.1. Schema Check</h5>
271 <p>To verify the schema of a dataset.
272 Choose a DataNode to read the schema configured.</p>
273
274 </div>
275 <div id="inputvalidation" style="padding: 15px;">
276 <h5>6.2. Business Validation</h5>
277 <p>To verify if the data is in an expected format catering business needs.
278 Below details are required to add a business validation.
279 </p>
280 <p><b>Validation Name: </b>
281 Free text field to define a name of the validation being created.</p>
282 <p><b>Table Alias: </b>
283 Select a datanode from the dropdown on which the business validation to be applied.</p>
284 <p><b>Validation Types: </b>
285 Select a column name of the datanode chosen in the above step. This is to indicate on which column
286 the validation will be performed.</p>
287 <p><b>Execution Tag: </b>
288 Select a tag from the list to classify the data quality category for the validation being created
289 which will be reflected in the dashboards. </p>
290 </div>
291
292 <div id="datarecon" style="padding: 15px;">
293 <h5>6.3. Data Recon</h5>
294 <p>To verify if the correct set of data is flowing from source to destination datanode.</p>
295
296 <p><b>Flow Number: </b>
297 Select the flow created in the DataNode section.</p>
298 <p><b>Source Aggregated Column: </b>
299 Select a column from source datanode on which the ‘GroupBy’ aggregation to be performed.</p>
300 <p><b>Target Aggregated Column: </b>
301 Select a column from target datanode on which the ‘GroupBy’ aggregation to be performed. </p>
302 <p><b>Target Filter: </b>
303 Specify the filter which will be applied on the target datanode after the data is moved from source.
304 </p>
305 <p><b>Source Column Name: </b>
306 If a column name is different in source and target. Select the column from a source datanode which has to be mapped to the target datanode.</p>
307 <p><b>Target Column Name: </b>
308 If a column name is different in source and target. Select the column from a target datanode which has to be mapped to the source datanode selected
309 above.</p>
310 <p><b>Recon Measures: </b>
311 List of aggregation functions to be performed on the Source column.</p>
312 <p><b>Recon Tag: </b>
313 Free text field to group a set of identical Data flows.</p>
314
315 </div>
316
317
318 <div id="dataprofile" style="padding: 15px;">
319 <h5>6.4. Data Profile</h5>
320 <p>Process of examining, analyzing, and creating useful summaries of data. The process yields a
321 high-level overview which aids in the discovery of data quality issues, risks, and overall trends.
322 </p>
323 <p><b>Data Node Alias: </b>
324 Select a datanode from the list on which the profiling has to be done</p>
325
326 <p><b>Profile Tag: </b>
327 Free text field to group a set of identical Data Profiles.</p>
328
329 <p><b>Group By: </b>
330 Select a or a set of columns from the list on which GroupBy has to be applied.</p>
331
332 <p><b>Column Name: </b>
333 Choose a column from the dropdown on which the profiling to be done.</p>
334
335 <p><b>Profile Measures: </b>
336 Select All or any of the profile measures based on the type of analysis user has to perform which helps an
337 organization in decision making</p>
338
339 </div> -->
340 <div id="data-orchestration" style="padding: 15px;">
341 <h3> 5.Data Orchestration</h3>
342 <p>Data Orchestration is the process of automating, managing, and coordinating the movement and transformation of data across various systems and applications. It ensures that data flows seamlessly from source to destination, enabling efficient data integration and analysis.
343 Data Katalyst Orchestration is built on Airflow which provides provide the infrastructure and capabilities to design, deploy, and manage complex data workflows, enabling organizations to derive insights and make data-driven decisions more effectively.
344 </p>
345 <p>List of operations which can be used to create a DAG are as follows</p>
346 <li>DQ Steps</li>
347 <li>Flows</li>
348 <li>Rule Engine</li>
349
350 <p><b> DQ Steps</b>
351 Operation: User can configure the DQ operation to perform from the dropdown.
352 DataNodes: A list of datanodes appears where one can select to test the DQ.
353 Source Prefix: The prefix to be added for the generated code filename.
354 </p>
355
356
357 </div>
358 <!-- <div id="transformarions" style="padding: 15px;">
359 <h3>7. Transformations</h3>
360 <p></p>
361 <p><b>Flow Name: </b>
362 Select a value from the dropdown indicating for which flow the transformation has to be created.</p>
363 <p><b>Target Column Name: </b>
364 Enter the name of the column on which the aggrgation function to be applied</p>
365 <p><b>Target Column Expression: </b>
366 Enter the aggregation expression to be performed on the above selected column. User may choose to copy the column as is to the target file without adding any expression </p>
367
368 </div> -->
369 <!-- <div id="data-generation" style="padding: 15px;">
370 <h3>8. Data Generation</h3>
371 <p>To create the dataset which helps the development team to verify the available functionalities of DataToolKit.
372 The option is not just to create the dataset in the high volume but also to create along with the schema
373 applicable.</p>
374
375 </div> -->
376 <div id="dashboard" style="padding: 15px;">
377 <h3>8. Output & Reports</h3>
378
379 <h5>Overview Section</h5>
380
381 <p>Summary Metrics: High-level metrics such as overall data quality
382 score, number of records processed, and percentage of data passing validation checks.</p>
383 <p>Trends: Visualizations showing how data quality metrics have changed over time.</p>
384 <p><b>DQ Score: </b>
385 Overview of the total rows passed across the data nodes within a project in a percentage format.
386 The score does not include the duplicate data. For ex: A row failing for both schema and business Validation will be considered as one failure.
387 <p>The score varies as per the weightage allocated at each Data Node level or at a Parameters level. If the weightage is not defined,
388 by default 100% is taken to calculate the score.
389 The six data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity
390 </p>
391 <h5>Data Quality Dimensions</h5>
392 <p><b>Completeness: </b>
393 Defines all the required records and values should be available with no missing information. It measures what information is missing.
394 For example, an address on a membership form. If three forms out of 100 are missing addresses, the data, regarding addresses, is 97% complete.
395 This is measured based on the primary key and business validation configurations.
396 </p>
397 <p><b>Validity: </b>
398 This is to measure the data failing for a set of patterns matching the real world events,
399 such as, phone number, currency, emailID patterns. This is achieved by the Patterns defined in the
400 Common Validation.</p>
401 <p><b>Uniqueness: </b>
402 This is designed to avoid the same data being stored multiple times. When data is unique,
403 no record exists more than once within a table. Each record can be uniquely identified, with no redundant
404 storage. This is achieved by a primary key and alternate key configurations set in the datanode schema.</p>
405 <p><b>Timeliness: </b>
406 Measures the SLA (Service-Level Agreement) of the code execution against the SLA configured in Parameters. It is a boolean
407 value. If the value is Yes, the time taken for the execution is within the SLA else the time exceeded the time configured in the system.</p>
408 <p><b>Consistency: </b>
409 Consistency means the data across all systems reflects the same information and are in sync with each other across the enterprise.
410 It is measured by comparing the data between the datanodes against a foreign key and primary key combination configured in business validation. </p>
411 <p><b>Accuracy: </b>
412 Measures foreign key relation configured in a schema set at the Data Node.</p>
413
414 </div>
415 </div>
416</div>
417
418
419{% endblock %}