Z5pJCaH5

· last year · Jul 28, 2024, 02:00 PM
1{% extends "../base.html" %}
2
3
4{% block style %}
5<link rel="stylesheet" type="text/css" href="/media/stylesheets/helper.css" />
6{% endblock %}
7
8{% block container %}
9<div class="main_container">
10     {% include '../helper_sidebar.html'%}
11
12     <div class="info_container">
13          <div id="main" style="padding: 15px;">
14               <p>Over the years, data has gradually evolved to become a crucial element in business decision-making. Advances in technology have revolutionized the ways data is collected, stored, analyzed, and utilized.
15               </p>
16               <p><b>Data Katalyst </b> comprises multiple components that independently streamline these processes. It is designed around a No/Low code concept, simplifying the development process. Users can configure the required input, which then auto-generates and executes the necessary code using Airflow for data analysis.
17                <p>The application relies heavily on the configuration of its various components. This foundational process ensures that code is generated based on the set configurations. </p>
18               </p>
19
20
21          </div>
22          <div id="connection" style="padding: 15px;">
23               <h3>1. Settings</h3>
24               <p>The Settings allows users to connect to external data sources. These settings can be shared across all projects or specific ones using the ‘Share’ option. 
25                    Below are the available parameters which can be configured.
26               </p> 
27
28               <p><b>Connection: </b>
29                    Users can connect to the following external sources:
30               </p>
31               <p><i>Local Store: </i> This is the default selection under the option Type. Users should provide the local directory path where the source files are stored. </p>
32               <p class="p1">&nbsp;&nbsp;Name: Field to name the connection</p>
33               <p>&nbsp;&nbsp;Path: Field to specify the directory path.</p>
34
35
36               <p><i>HDFS: </i>
37                    For files stored in HDFS, select the ‘HDFS’ option under Type and configure the following fields:
38               </p>
39               <p class="p1">&nbsp;&nbsp;Name: Field to specify the connection name.</p>
40               <p>&nbsp;&nbsp;URL: Field to provide the HDFS URL.</p>
41
42               <p><i>Amazon S3: </i>
43                    For files stored in an S3 bucket, select the ‘Amazon S3’ option and configure the following fields:
44               </p>
45               <p class="p1">&nbsp;&nbsp;Name: Field to specify the connection name.</p>
46               <p>&nbsp;&nbsp;URL: Field to provide the S3 URL.</p>
47
48               <p><i>DB connections: </i> 
49                    For reading data from a database, select the ‘Database’ option and configure the following fields:
50               </p>
51               <div class="para">
52                    <p>&nbsp;&nbsp;Name: Field to specify the connection name.</p>
53                    <p>&nbsp;&nbsp;URL: Field to specify the JDBC connection URL.</p>
54                    <p>&nbsp;&nbsp;Database Name: Field to specify the database name.</p>
55                    <p>&nbsp;&nbsp;Port: Field to specify the port if not included in the JDBC connection URL.
56                    </p>
57               </div>
58               <p><b>Pandas: </b>
59                    Options to control Pandas settings.
60               </p>
61               <p><b>Environment Variables: </b>
62                    Options to set environment variables.
63               </p>
64               <p><b>Kafka Streaming: </b>
65                    Options to set Kafka related configurations such as topic.
66               </p>
67               <p><b>General: </b>
68                    Configurations applicable to all the data nodes will be set here.
69               </p>
70               <p><b>Spark: </b>
71                    Configurations applicable to all the data nodes will be set here.
72                    Spark: Keys applicable to the Spark application, such as:
73                    'spark.executor.memory'
74                    'spark.app.name'
75                    'spark.driver.memory'
76               </p>
77
78          </div>
79          <!-- <div id="parameter" style="padding: 15px;">
80               <h3>2. Parameters</h3>
81               <p>Parameters are static values in the form of key-value pairs used to customize the external application’s configuration. They also enable users to set paths for storing generated code files. The available parameters include:
82               </p>
83               <ul>Spark: Keys applicable to the Spark application, such as:
84                    'spark.executor.memory'
85                    'spark.app.name'
86                    'spark.driver.memory'
87               </ul>
88               <ul>Pandas: Options to control Pandas settings.</ul>
89               <ul>Environment: Options to set environment variables.</ul>
90               <ul>Settings: Project-level settings, including paths required to store generated code files.</ul>
91          </div> -->
92
93          <div id="datanode" style="padding: 15px;">
94               <h3>2. Data Node</h3>
95               <p>A Data Node holds the data structure of tables and can be categorized as Input, Target, Reference, Logical DataNode or Intermediary against a connection name defined in Connection settings.</p>
96               <p><b>Subject Area: </b>
97                    Free text field to specify a name and group the identical data nodes.</p>
98               <p><b>Name: </b>
99                    Indicates from where the data to be read, mention the table name if the connection is from a database, or the file name (including the extension) if the connection is from sources like Local Store, HDFS, or Amazon S3.</p>
100               <p><b>Source Category: </b>
101                    Indicates the type of data node to store the structure of the columns:</p>
102                    <p>Input: Treats it as a source from where the data will be read.</p>
103                    <p>Output: Used in transformation to store the target data</p>
104                    <p>Reference: Indicates the tables which can be used as references such as country codes, currency codes etc.</p>
105                    <p>Intermediary: Typically used for storing the transformation data in between Input and Output.</p>
106                    <p>Logical DataNode: Used to create a view like by joining 2 datanodes to make a datanode in itself </p>
107               <p><b>Data Node Alias: </b>
108                    An alternate name for the Data Node, used throughout the application.</p>
109               <p><b>Connection Name: </b>
110                    A list of names from the Connections configuration, where the base path is set, allowing data retrieval.</p>
111
112               <p>For each Data Node, users can specify schema and parameters:
113               </p>
114               <p><b>Schema: </b>
115                    Defines the properties of a data node. </p>
116               <ul><i>Column Name: </i>Indicates the name of a column in the data node.
117               </ul>
118               <ul><i>Roll Up: </i>Specifies the type of aggregation function to be performed when a GroupBy operation is applied to a column.</ul>
119               <ul><i>Foreign Key: </i>Specifies a foreign key relationship in the format ‘datanode.columnname’ if a column of the selected data node is a foreign key in another data node.</ul>
120               <ul><i>Mandatory: </i>Marks a column as mandatory if it should not be null.</ul>
121               <ul><i>Primary Key: </i> Specifies the primary key of the data node in the format ‘{Key}’.</ul>
122               <ul><i>Alternate Key: </i> Specifies an alternate primary key for the data node in the format ‘{Key1, Key2}’.</ul>
123
124               <p><b>Parameters: </b>
125                    Values specific to a data node can be mentioned here such as business date, weightage etc.</p>
126
127          </div>
128          <div id="flows" style="padding: 15px;">
129               <h3>3. Flows</h3>
130               <p>Flows define how data is transferred from one point to another, based on the filters and aggregation functions that may be applied to the data node.</p>
131               <p><b>Flow Type:: </b>
132                    Selection field to specify if the flow contains aggrgated columns or non aggregated columns</p>  
133               <p><b>Flow Tag: </b>
134                    Free text field to specify a name and group the identical flows.</p>            
135               <p><b>Source: </b>
136                    Indicates the data node from which the filters and aggregations to be performed.</p>
137               <p><b>Target: </b>
138                    Indicates the data node where the data will be put post transformation.</p>
139               <p><b>Aggregate Columns: </b>
140                    Specify the columns on which the GroupBy function to be applied.</p>
141               <p><b>Pre Aggregate Condition: </b>
142                    Specifies the filter to be applied to the source data node before the aggregation function.</p>
143               <p><b>Post Aggregate Condition: </b>
144                    Specifies the filter to be applied on the Target data node post the aggregation function.</p>
145               <p><b>Write mode  </b>
146                    Option to select if the output data to be added in the same file or override the existing file.</p>
147               <p><b>Column Mapping: </b>
148                         Option to map the columns of source to correspondeing columns of the target</p> 
149               <p><b>Target Column Name: </b>
150                         List of columns available in the target data node</p>
151               <p><b>Target Column Expression: </b>
152                         Enter the aggregation expression to be performed on the column of the source. for ex: sum('Col1')
153                         User also can choose to copy the column from source as is to the target file without adding any expression. for ex: col('Col1') </p>
154     
155     
156          </div>
157          <div id="common-validation" style="padding: 15px;">
158               <h3>4. DQ Validations</h3>
159               <p>List of steps involved in the Data Quality process to verify and validate the data coming from the
160                    source table. Below is the list of operations available to perform Data Quality check.
161               </p>
162               <ul>
163                    <li>Common Validation</li>
164                    <li>Business Validation</li>
165                    <li>Data Recon</li>
166                    <li>Data Profile</li>
167               </ul>
168               <p><b>Common Validation:</b>
169               Library of validations which are universal across the data collection to verify the column’s data is
170                    in an expected format. System has a predefined set of validations which can be applied on a column
171                    of a datanode. For ex.: Mobile Number, Email, Country Code.</p>
172               Following details are required to create an user defined validation.
173               <p><i>&nbsp;Validation Name: </i>
174                    Indicates the name of the Validation which is being created.</p>
175               <p><i>&nbsp;Condition: </i>
176                    Specify the condition according to the the data pattern expected.
177                    Multiple validations can also be combined using AND / OR operators. Click on Add to specify the
178                    operators and Save to save the validation.
179               </p>
180               <p><b>Business Validation: </b>
181               To verify if the data is in an expected format catering business needs.
182                    Below details are required to add a business validation.
183               </p>
184               <p><i>&nbsp;Validation Name: </i>
185                    Free text field to define a name of the validation being created.</p>
186               <p><i>&nbsp;Table Alias: </i>
187                    Select a datanode from the dropdown on which the business validation to be applied.</p>
188               <p><i>&nbsp;Column Name: </i>
189                    list of columns available in the selected data node on which the validation will be applied</p>
190               <p><i>&nbsp;Validation Types: </i>
191                    Select a column name of the datanode chosen in the above step. This is to indicate on which column
192                    the validation will be performed.</i>
193               <p><i>&nbsp;Execution Tag: </i>
194                    Select a tag from the list to classify the data quality category for the validation being created
195                    which will be reflected in the dashboards. </p>
196               <p><i>&nbsp;Failure Condition: </i>
197                    Define a condition for a column, if the data fails to meet the specified validation criteria
198                    the system will log the failure. The option is available when the Validation type is Business Value </p>
199                    <p>Following Fields are required to check Consistency of a datanode:</p> 
200               <p><i>&nbsp;Source Key: </i>
201                    Specify a key whose values will be used as a foreign key in the reference table</p>
202               <p><i>&nbsp;Reference Table Alias: </i>
203                    Select a datanode to be referenced to check the consistency of a column. </p>
204               <p><i>&nbsp;Reference Column: </i>
205                    Specify the column from the reference table to be verified with the Column Name specified in the previous step. </p>
206               <p><i>&nbsp;Foreign Key: </i>
207                    Foreign Key in Consistency ensures that all foreign key values in a data node match the Source key values in the 
208               other data node</p>
209
210               <p><b> Data Recon</b></p>
211               <p>To verify if the correct set of data is flowing from source to destination datanode.</p>
212
213               <p><i>&nbsp;Flow Number: </i>
214                    Select the flow created in the DataNode section.</p>
215               <p><i>&nbsp;Source Aggregated Column: </i>
216                    Select a column from source datanode on which the ‘GroupBy’ aggregation to be performed.</p>
217               <p><i>&nbsp;Target Aggregated Column: </i>
218                    Select a column from target datanode on which the ‘GroupBy’ aggregation to be performed. </p>
219               <p><i>&nbsp;Target Filter: </i>
220                    Specify the filter which will be applied on the target datanode after the data is moved from source.
221               </p>
222               <p><i>&nbsp;Source Column Name: </i>
223                    If a column name is different in source and target. Select the column from a source datanode which has to be mapped to the target datanode.</p>
224               <p><i>&nbsp;Target Column Name: </i>
225                    If a column name is different in source and target. Select the column from a target datanode which has to be mapped to the source datanode selected
226                    above.</p>
227               <p><i>&nbsp;Recon Measures: </i>
228                    List of aggregation functions to be performed on the Source column.</p>
229               <p><i>&nbsp;Recon Tag: </i>
230                    Free text field to group a set of identical Data flows.</p>
231
232                    <p><b> Data Profile</b></p>
233                    <p>Process of examining, analyzing, and creating useful summaries of data. The process yields a
234                         high-level overview which aids in the discovery of data quality issues, risks, and overall trends.
235                    </p>
236                    <p><i>&nbsp;Data Node Alias: </i>
237                         Select a datanode from the list on which the profiling has to be done</p>
238              
239                    <p><i>&nbsp;Profile Tag: </i>
240                         Free text field to group a set of identical Data Profiles.</p>
241         
242                    <p><i>&nbsp;Group By: </i>
243                         Select a or a set of columns from the list on which GroupBy has to be applied.</p>
244                  
245                    <p><i>&nbsp;Column Name: </i>
246                         Choose a column from the dropdown on which the profiling to be done.</p>
247               
248                    <p><i>&nbsp;Profile Measures: </i>
249                         Select All or any of the profile measures based on the type of analysis user has to perform which helps an
250                         organization in decision making</p>
251
252          </div>
253          <!-- <div id="dqsteps" style="padding: 15px;">
254               <h3>6. DQ STEPS</h3>
255               <p>List of steps involved in the Data Quality process to verify and validate the data coming from the
256                    source table. Below is the list of operations available to perform Data Quality check.</p>
257               <ul>
258                    <li>Schema Check</li>
259                    <li>Business Validation</li>
260                    <li>Data Recon</li>
261                    <li>Data Profile</li>
262               </ul>
263               <p><b>Operation: </b>
264                    Select a DataQuality operation to be performed on a data node or a set of datanodes.</p>
265               <p><b>Source Prefix: </b>
266                    Name of the generated code file.</p>
267
268          </div> -->
269          <!-- <div id="schemacheck" style="padding: 15px;">
270               <h5>6.1. Schema Check</h5>
271               <p>To verify the schema of a dataset.
272                    Choose a DataNode to read the schema configured.</p>
273
274          </div>
275          <div id="inputvalidation" style="padding: 15px;">
276               <h5>6.2. Business Validation</h5>
277               <p>To verify if the data is in an expected format catering business needs.
278                    Below details are required to add a business validation.
279               </p>
280               <p><b>Validation Name: </b>
281                    Free text field to define a name of the validation being created.</p>
282               <p><b>Table Alias: </b>
283                    Select a datanode from the dropdown on which the business validation to be applied.</p>
284               <p><b>Validation Types: </b>
285                    Select a column name of the datanode chosen in the above step. This is to indicate on which column
286                    the validation will be performed.</p>
287               <p><b>Execution Tag: </b>
288                    Select a tag from the list to classify the data quality category for the validation being created
289                    which will be reflected in the dashboards. </p>
290          </div>
291
292          <div id="datarecon" style="padding: 15px;">
293               <h5>6.3. Data Recon</h5>
294               <p>To verify if the correct set of data is flowing from source to destination datanode.</p>
295
296               <p><b>Flow Number: </b>
297                    Select the flow created in the DataNode section.</p>
298               <p><b>Source Aggregated Column: </b>
299                    Select a column from source datanode on which the ‘GroupBy’ aggregation to be performed.</p>
300               <p><b>Target Aggregated Column: </b>
301                    Select a column from target datanode on which the ‘GroupBy’ aggregation to be performed. </p>
302               <p><b>Target Filter: </b>
303                    Specify the filter which will be applied on the target datanode after the data is moved from source.
304               </p>
305               <p><b>Source Column Name: </b>
306                    If a column name is different in source and target. Select the column from a source datanode which has to be mapped to the target datanode.</p>
307               <p><b>Target Column Name: </b>
308                    If a column name is different in source and target. Select the column from a target datanode which has to be mapped to the source datanode selected
309                    above.</p>
310               <p><b>Recon Measures: </b>
311                    List of aggregation functions to be performed on the Source column.</p>
312               <p><b>Recon Tag: </b>
313                    Free text field to group a set of identical Data flows.</p>
314
315          </div>
316
317
318          <div id="dataprofile" style="padding: 15px;">
319               <h5>6.4. Data Profile</h5>
320               <p>Process of examining, analyzing, and creating useful summaries of data. The process yields a
321                    high-level overview which aids in the discovery of data quality issues, risks, and overall trends.
322               </p>
323               <p><b>Data Node Alias: </b>
324                    Select a datanode from the list on which the profiling has to be done</p>
325         
326               <p><b>Profile Tag: </b>
327                    Free text field to group a set of identical Data Profiles.</p>
328    
329               <p><b>Group By: </b>
330                    Select a or a set of columns from the list on which GroupBy has to be applied.</p>
331             
332               <p><b>Column Name: </b>
333                    Choose a column from the dropdown on which the profiling to be done.</p>
334          
335               <p><b>Profile Measures: </b>
336                    Select All or any of the profile measures based on the type of analysis user has to perform which helps an
337                    organization in decision making</p>
338
339          </div> -->
340               <div id="data-orchestration" style="padding: 15px;">
341                    <h3> 5.Data Orchestration</h3>
342                    <p>Data Orchestration is the process of automating, managing, and coordinating the movement and transformation of data across various systems and applications. It ensures that data flows seamlessly from source to destination, enabling efficient data integration and analysis.
343                         Data Katalyst Orchestration is built on Airflow which provides provide the infrastructure and capabilities to design, deploy, and manage complex data workflows, enabling organizations to derive insights and make data-driven decisions more effectively.
344                    </p>
345                    <p>List of operations which can be used to create a DAG are as follows</p>
346                    <li>DQ Steps</li>
347                    <li>Flows</li>
348                    <li>Rule Engine</li>
349
350                    <p><b>&nbsp;DQ Steps</b>
351                    Operation: User can configure the DQ operation to perform from the dropdown.
352                    DataNodes: A list of datanodes appears where one can select to test the DQ.
353                    Source Prefix: The prefix to be added for the generated code filename.
354                    </p>
355
356
357               </div>
358          <!-- <div id="transformarions" style="padding: 15px;">
359               <h3>7. Transformations</h3>
360               <p></p>
361               <p><b>Flow Name: </b>
362                    Select a value from the dropdown indicating for which flow the transformation has to be created.</p>
363               <p><b>Target Column Name: </b>
364                    Enter the name of the column on which the aggrgation function to be applied</p>
365               <p><b>Target Column Expression: </b>
366                    Enter the aggregation expression to be performed on the above selected column. User may choose to copy the column as is to the target file without adding any expression </p>
367
368          </div> -->
369          <!-- <div id="data-generation" style="padding: 15px;">
370                    <h3>8. Data Generation</h3>
371                    <p>To create the dataset which helps the development team to verify the available functionalities of DataToolKit.
372                         The option is not just to create the dataset in the high volume but also to create along with the schema
373                         applicable.</p>
374
375               </div> -->
376          <div id="dashboard" style="padding: 15px;">
377               <h3>8. Output & Reports</h3>
378
379               <h5>Overview Section</h5>
380
381               <p>Summary Metrics: High-level metrics such as overall data quality 
382               score, number of records processed, and percentage of data passing validation checks.</p>
383               <p>Trends: Visualizations showing how data quality metrics have changed over time.</p>
384               <p><b>DQ Score: </b>
385                    Overview of the total rows passed across the data nodes within a project in a percentage format.
386                    The score does not include the duplicate data. For ex: A row failing for both schema and business Validation will be considered as one failure.
387                    <p>The score varies as per the weightage allocated at each Data Node level or at a Parameters level. If the weightage is not defined,
388               by default 100% is taken to calculate the score.
389               The six data quality dimensions are Accuracy, Completeness, Consistency, Uniqueness, Timeliness, and Validity
390               </p>
391               <h5>Data Quality Dimensions</h5>
392               <p><b>Completeness: </b>
393                    Defines all the required records and values should be available with no missing information. It measures what information is missing.
394                    For example, an address on a membership form. If three forms out of 100 are missing addresses, the data, regarding addresses, is 97% complete.
395                    This is measured based on the primary key and business validation configurations.
396               </p>
397               <p><b>Validity: </b>
398                    This is to measure the data failing for a set of patterns matching the real world events,
399                    such as, phone number, currency, emailID patterns. This is achieved by the Patterns defined in the
400               Common Validation.</p>
401               <p><b>Uniqueness: </b>
402                    This is designed to avoid the same data being stored multiple times. When data is unique,
403                    no record exists more than once within a table. Each record can be uniquely identified, with no redundant
404                    storage. This is achieved by a primary key  and alternate key configurations set in the datanode schema.</p>
405               <p><b>Timeliness: </b>
406                    Measures the SLA (Service-Level Agreement) of the code execution against the SLA configured in Parameters. It is a boolean
407               value. If the value is Yes, the time taken for the execution is within the SLA else the time exceeded the time configured in the system.</p>
408               <p><b>Consistency: </b>
409                    Consistency means the data across all systems reflects the same information and are in sync with each other across the enterprise.
410               It is measured by comparing the data between the datanodes against a foreign key and primary key combination configured in business validation. </p>
411               <p><b>Accuracy: </b>
412                    Measures foreign key relation configured in a schema set at the Data Node.</p>
413
414          </div>
415     </div>
416</div>
417
418
419{% endblock %}