· 7 years ago · Nov 06, 2018, 07:10 PM
1Question Answer
2Top 5 books on Bigdata in your opinion?
3Top certifications in your opinion.
4Top online classes on Bigdata in your opinion.
5Difference between Hortonworks and Cloudera. No ORC file support in Cloudera Hive (no transactions, updates, deletes, no Streaming API)
6Difference btw Hadoop and Hadoop 2. New cluster manager YARN
7Explain Parquet format and it’s adv. disadv.
8Explain “partition toleranceâ€. Network partitioning (2 nodes cannot communicate).
9Consistency/Availability/Partitioning tolerance.
10CA/CP/AP.
11Explain “hotspottingâ€.
12Explain “saltingâ€.
13What is YARN? Cluster job scheduler
14What is NameNode? Node containig metadata of all files on HDFS
15Explain MapReduce
16It's a programming model that allows you to express algos for execution in distributed system.
17
18It's used in cooperation with distributed file system.
19
20How to avoid DataNode queries? Use HDFS short circuit reads (read HDFS file directly/locally without DataNode call).
21Largest cluster you worked with?
22Lardest dataset you worked with?
23Data type you worked with (structured, semi-s, unstructured)
24Give example of troubleshooting.
25Give example of performance improvement.
26How do you interface with Hadoop from command line? java
27Which protocol is used to connect to Hadoop? thrift
28What's the name of the script that starts/stops Hadoop daemons? start-all.sh stop_all.sh
29How do you access HDFS?
30How can one improve HDFS performance? Mount options for HDFS disks, HDFS block size, short circuit reads, avoid files smaller that block size. Tune DataNode JVM for optimal performance.
31Explain Traits in Scala.
32Explain case classes in Scala.
33Is case-to-case inheritance possible in Scala?
34What is functional language?
35How do you compile Scala code?
36How do you improve compilation speed of Scala code?
37What is companion object in Scala?
38Explain how Scala code is compiled and executed. Compiled by scala compiler. Executed by JVM (same VM that executes Java binary code)
39Is Scala parsed into Java code and then compiled and executed on JVM? No. Scala is compiled into bytecode directly and then executed in JVM.
40Is Scala faster than Java? No
41What advantages of Scala over Java?
42Is JVM exclusively used for executing Java bytecode? No. Scala, Closure + more languages.
43What features Scala has which does not exists in Java? List few. functional programming. Traits, case classes, companion objects.
44What are major advantages of Scala over Java? Consise, functional landuage features, immutable types, concurrent programming, parsers.
45What are Scala disadvantages? Requires JVM, bound by what JVM can do. Cryptic, hard to understand. "_" underscore letter can mean 15 different things depending on context. Very slow to compile.
46Is Hbase column or row oriented? column
47Explain Hbase table format. System tables vs user tables. Column families (CFs), column qualifiers (CQs). Column+key==cell. Multiple versions of the same cell by timestamp. Cell==KeyValue pair. Row formed of group of cells.
48How Hbase orders keys? Based on a byte values.
49What is a BLOOM FILTER? BF will tell Hbase if given key might be or is not in a given file.
50Explain Hbase compaction/spit/balancing. Compacts many small files into bigger one. Splits - opposite to compaction. Every 5 mins balancer is executed to ensure RegionServers are manging similar number of regions.
51Is HBase ACID compliant? yes
52Example of HBase managing tools? Cloudera manager. Apache Ambari. Hannibal.
53Example of tools brining SQL functionality to Hadoop/Hbase? Apache Phoenix, Trafodion, Splice Machine.
54What are considerations when you do sizing and schema design of HBase cluster? Ingres/egress workload, SLAs, capacity
55How do u tune Hbase? Regions per node. Raw space per node. Number of WALs to keep. Percentage of heap devoted to memstore and block cache.
56Usecase: HBase as sytem of recods. Explain implementation 1. Flat files->HDFS->Avro
572. ETL based on joins on primary key. All transforms are handled using MapReduce.
583. Use Pentaho to avoid manual MapReduce coding and use GUI instead.
594. Avro files loaded into temp staging HBase tables.
605. Using bulk load tool completebulkload load HFiles into staging table.
615.Once loaded to staging data is pushed to serving engine (Solr or Impala)
626.Avro->Parquet->Impala
63
64In this usecase:
65a. Solr is used for natural language searches.
66b. HBase - full detail records
67c. Impala - DW, aggregations, SQL
68What languages do you use to interface with Spark? Java, Scala, PySpark
69What's the name for bulk loading tool in Hbase? Explain input output. completebulkload. Takes URL as location of a file on HDFS, then loads each file into relevant region. Splits Hfile according to region boundary.
70Desribe Hive 1. Opensource distributed framework in Hadoop.
712. Provides SQL-like interface to files on HDFS.
723. Run mapReduce for each SQL.
734. Developed by Facebook.
74What is Hive Metastore? Hive table and db definitions are stored in Metastore.
75Consists of 2 main components - services and database for metadata.
76Provides API to query database, tables, schema.
77How do you access Hive? Hive CLI, Beeline
78What is ZooKeeper? Open source centralized service for providing coordination between distributed applications.
79How hive solves high availability? By using ZooKeeper. Hive stores configuration in it and it provides high availability of HiveServer2
80Explain how ZooKeeper helps Hive high availability? If multiple HiveServer2 instances are registered with AooKeeper and all of the fail but one - ZooKeeper passes URL to that instance to client to it can execute queries.
81ZK will not restart failed instances.
82How SQL clients connect to Hive? Using HiveQL view JDBC/ODBC
83How do you connect to hive from Python? pyHS2 module
84What is a backend of Hive Metastore? Derby/MySQL/Oracle
85Does Hive support external tables? Yes
86Partitioning in Hive? yes
87UDFs in Hive? yes, Python, Java
88List complex data types in Hive Struct, Map, Array
89How do you list partitions in Hive managed table? SHOW PARTITIONS <tablename>
90How to avoid full table scal in Hive partitioned managed table? set hive.mapred.mode=strict
91(will require partition predicate in each SQL)
92How to add new partition in Hive table? ALTER TABLE ADD PARTITION
93How to rename Hive table partition? ALTER TABLE PARTITION RENAME TO PARTITION
94How do you exchange Hive table partitions? ALTER TABLE EXCHANGE PARTITION WITH TABLE
95Can I load data into specific Hive managed table partition? yes
96Give me example of CTAS statement in Hive? INSERT OVERWRITE TABLE … PARTITION … FROM (truncates and inserts)
97INSERT INTO TABLE PARTITION .. FROM (appends)
98Types of partitioning in Hive tables? Static (based on column value), dynamic (cased on column name in SQL)
99What's the difference between managed and external table partition delete? External partion data files are preserved. (not deleted phisically)
100Disadvantage of partitioning? SQL does not perform well when number of partitions is high.
101Explain Bucketing Essentially hash partitioning. Fixed number of buckets for the whole data.
102Bucketing limitations Cannot execute LOAD DATA, but use INSERT instead (from existing tables)
103How to drop Hive database? DROP DATABASE dbname CASCADE;
104How to switch to different schema/db? USE (DATABASE/SCHEMA) dbname;
105How to list available dbs / schemas in Hive? SHOW (DATABASE/SCHEMAS) [LIKE wildcards;]
106Explain TEMPORARY tables in Hive They exists only for the length of the session. You cannot create partition or index on temp table
107Explain 'SKEWED BY' in Hive table DDL. Heavy skew values are split into separate files.
108Explain 'CLUSTERED BY' in Hive table DDL. Used to define bucketing for table or partition
109What is hotspotting and how to avoid it in Hbase? use hashing keys
110What's the disadvantage of using Hashing Keys in Hbase? Data is losing original ordering so you cannot rely on initial order of the key in your queries.
111Give example of Hbase table parameters COMPRESSION, DATA BLOCK ENCODING, BLOOM FILTER, PRESPLITTING
112Give example of different computing environments for Hadoop? Ground - inhouse, on-premise cluster.
113Cloud - AWS EMR. Cluster in VPC/private subnet in Amazon cloud service.
114What is Hbase? Non-relational database that allows low latency quick lookup jobs in Hadoop.
115It adds trarnsactional capabilities to Hadoop, allowing DMLs.
116Is data bompressed by default in Hbase? no
117Which compression algos are supported by Hbase? LZO, GZ, SNAPPY, LZ4
118How do you improve RDD serialization in Spark?
119What's the fastest compression algo used in Hbasse? SNAPPY
120What is FAST_DIFF? It's data block encoding option which will store only difference between current key and previous.
121What is PRESPLITTING? Presplitting a table means asking Hbase to split the table into multiple regions when it's created.
122How would you convert CSV files to Avro(HFiles)? Table table = connection.getTable(tableName);
123Job job = Job.getInstance(conf, "ConvertToHFiles: Convert CSV to HFiles");
124HFileOutputFormat2.configureIncrementalLoad(job, table, connection.getRegionLocator(tableName));
125job.setInputFormatClass(TextInputFormat.class); job.setJarByClass(ConvertToHFiles.class);
126job.setJar("/home/cloudera/ahae/target/ahae.jar");
127job.setMapperClass(ConvertToHFilesMapper.class);
128job.setMapOutputKeyClass(ImmutableBytesWritable.class);
129job.setMapOutputValueClass(KeyValue.class); FileInputFormat.setInputPaths(job, inputPath);
130HFileOutputFormat2.setOutputPath(job, new Path(outputPath));
131public static final ByteArrayOutputStream out = new ByteArrayOutputStream();
132public static final DatumWriter<Event> writer = new SpecificDatumWriter<Event> (Event.getClassSchema());
133public static final BinaryEncoder encoder = encoderFactory.binaryEncoder(out,null);
134public static final Event event = new Event();
135public static final ImmutableBytesWritable rowKey = new ImmutableBytesWritable();
136// Extract the different fields from the received line.
137String[] line = value.toString().split(","); 1
138
139event.setId(line[0]);
140event.setEventId(line[1]);
141event.setDocType(line[2]);
142event.setPartName(line[3]);
143 event.setPartNumber(line[4]);
144 event.setVersion(Long.parseLong(line[5]));
145event.setPayload(line[6]); 2
146
147// Serialize the AVRO object into a ByteArray
148out.reset(); 3
149writer.write(event, encoder); 4
150encoder.flush();
151
152byte[] rowKeyBytes = DigestUtils.md5(line[0]);
153rowKey.set(rowKeyBytes); 5
154 context.getCounter("Convert", line[2]).increment(1);
155
156KeyValue kv = new KeyValue(rowKeyBytes,
157CF,
158 Bytes.toBytes(line[1]),
159 out.toByteArray()); 6
160context.write (rowKey, kv);
161What happens to KeyValueSortReducer when number of columns in a row is too high? it runs out of memory and errors out (OOM). HBASE-13897. HBASE-14339. HBASE-14150
162What are common causes for MapReduce job to fail? Input file does not exists,.. Etc
163Which tool is used to extract metadata from HFiles? HFilePrettyPrinter
164How do you count rows in Hbase table using Hbase API? hbase org.apache.hadoop.hbase.mapreduce.RowCounter tablename
165What is CellCounter? MapReduce tool used to count number of rows/columns/versions in Hbase table.
166Limitations: on number of unique keys. HBASE-15773
167How do u read Avro object from Hbase? try (Connection connection = ConnectionFactory.createConnection(config); Table sensorsTable = connection.getTable(sensorsTableName)) { Scan scan = new Scan (); scan.setCaching(1); ResultScanner scanner = sensorsTable.getScanner(scan); Result result = scanner.next(); if (result != null && !result.isEmpty()) { Event event = new Util().cellToEvent(result.listCells().get(0), null); LOG.info("Retrived AVRO content: " + event.toString()); } else { LOG.error("Impossible to find requested cell"); } }
168How do you index Hbase Avro table to Solr using MapReduce? scan.setCaching(500); scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob( options.inputTable, // Input HBase table name scan, // Scan instance to control what to index HBaseAvroToSOLRMapper.class, // Mapper to parse cells content Text.class, // Mapper output key SolrInputDocumentWritable.class, // Mapper output value job); FileOutputFormat.setOutputPath(job, outputReduceDir); job.setJobName(getClass().getName() + "/" + Utils.getShortClassName(HBaseAvroToSOLRMapper.class)); job.setReducerClass(SolrReducer.class); job.setPartitionerClass(SolrCloudPartitioner.class); job.getConfiguration().set(SolrCloudPartitioner.ZKHOST, options.zkHost); job.getConfiguration().set(SolrCloudPartitioner.COLLECTION, options.collection); job.getConfiguration().setInt(SolrCloudPartitioner.SHARDS, options.shards); job.setOutputFormatClass(SolrOutputFormat.class); SolrOutputFormat.setupSolrHomeCache(options.solrHomeDir, job); job.setOutputKeyClass(Text.class); job.setOutputValueClass(SolrInputDocumentWritable.class); job.setSpeculativeExecution(false);
169How do you retrieve Avro data from Hbase based on Solr? CloudSolrServer solr = new CloudSolrServer("localhost:2181/solr"); solr.setDefaultCollection("Ch09-Collection"); solr.connect(); ModifiableSolrParams params = new ModifiableSolrParams(); params.set("qt", "/select"); params.set("q", "docType:ALERT AND partName:NE-555"); QueryResponse response = solr.query(params); SolrDocumentList docs = response.getResults(); LOG.info("Found " + docs.getNumFound() + " matching documents."); if (docs.getNumFound() == 0) return; byte[] firstRowKey = (byte[]) docs.get(0).getFieldValue("rowkey"); LOG.info("First document rowkey is " + Bytes.toStringBinary(firstRowKey)); // Retrieve and print the first 10 columns of the first returned document Configuration config = HBaseConfiguration.create(); try (Connection connection = ConnectionFactory.createConnection(config); Admin admin = connection.getAdmin(); Table sensorsTable = connection.getTable(sensorsTableName)) { Get get = new Get(firstRowKey); Result result = sensorsTable.get(get); Event event = null; if (result != null && !result.isEmpty()) { for (int index = 0; index < 10; index++) { // Print first 10 columns if (!result.advance()) break; // There are no more columns and we have not reached 10 event = new Util().cellToEvent(result.current(), event); LOG.info("Retrieved AVRO content: " + event.toString()); } } else { LOG.error("Impossible to find requested cell"); } }
170USECASE: Near Real-Time Event Processing SLA-15 seconds to be available for lookup and reference from the customer
17199.9% uptime
17230,000 event/sec
17310 requests/sec from Solr
174Kafka maintains different streams of messages referred as topics.
175Storm/Bolt is used for message enrichment
176Lily Indexer is used to write new records and updates into Solr
177It works as replication sink for HBase and Solr
178Spark is used as secondary index because HBase can only query the primary row key.
179Spark also contains dimensions.
180USECASE: Near realtime message enrichment using Storm. Storm/Bolt is used for message enrichment
181USECASE: Persist topic messages in HBase. 1. Flume picks up data from Kafka queue.
1822. Flume processes data if required
1833.Flume send data ti Hbase for storage
184USECASE: Populate and test Kafka test queue. 1. Write java program populating random XML messages.
1852. Use Kafka CLI (command line) to load them to Kafka.
1863. Configure Flume agent.
1874. Enghance Flume agent with interceptor to perform XML to Avro conversion.
188Explain most common usecases for Kafka. Kafka is used as :
1891. Queue of all the data received from external data sources and make it available for downstream consumers.
1902. Used to feed data to Flume agent channel.
191Explain most common uses of Flume. Streams data from source to a sink.
192Applies small modifications to a message if needed.
193It stores data in a channel until it's sent to destination
194How to improve distribution of keys in HBase? Even unique keys do not guarantee even distribution across regions in Hbase.
195Hashinf a key allows better distibution.
196Data collision of MD5 is low but possible this is why it's better to append HASH+KEY.
197Only 2 bytes of hash is required for better distribution.
198Alternative hash algo - CRC32
199What are alternatives to MD5 hashing algo? CRC32
200Explain Lily The goal of Lily indexer is to replicate into Solr all the trades persisted into Hbase.
201It's built on top of Hbase replication framework (this way all items are guaranteed to be forwarded to Solr).
202Explain Morphlines Morphlines is an open source framework that reduces the time and skills necessary to build and change Hadoop ETL stream processing applications that extract, transform and load data into Apache Solr, Enterprise Data Warehouses, HDFS, HBase or Analytic Online Dashboards.
203USECASE: How to one updates and inserts same Hbase row in one Flume event? Flume does not allow 2 events from one interceptor (one to update a trade and one to insert).
204On serializer side it's possible to generate 2 Hbase puts on the same row.
205Add all additional information into Flume Event header so it can be passed to serializer.
206Flume serialiser will generate 2 puts and update old trade.
207Flume Interceptor has to finish very quickly - otherwise it will overwhelm channel queue.
208Flume serializer is a class that's given a Flume event - it receives Parquet object from Inteceptor and even header and transfrorms it into HBase put operation.
209How do you avoid hotspotting RegionServer in HBase? Presplit tables into multiple regions.
210USECASE: Hbase as Master Data Management tool. System of record used as golden source of data (MDM).
211Records stored in Hbase will be used to rebuild any Hive or external data sources in the event of failure.
212New Hive partition is created on top of newly loaded data and then linked to existing archive table.
213List Hbase pros and cons MapReduce vs Spark. No Scala for MR.
214When we run MR job over Hbase table each mapper processes one region.
215You can use Puts and Increments to store data in Hbase from Spark.
216USECASE: Use Spark and flat file to enrich data in Hbase. In Spark create one mutation per row of the file and buffer that to be sent to Hbase table.
217Use Hbase Spark BulkPut framework to persist it in Hbase table.
218USECASE: Using Spark create Hfiles for Bulk load to Hbase Each Hfile will have to belong to one of the regions and will have to contain keys only within those region boundaries.
219There's no Java API to perform bulk loads into Hbase. JIRA HBASE-14217.
220Use Spark Hbase API to generate Hfiles.
221After Hfiles are generated use LoadIncrementalHFiles API to load files into table
222USECASE: Create Spark job to run over Hbase table and aggregate data. Spark will process different regions in parallel.
2231. Initialize scan.
2242. Create scan filer for key/value.
2253. Get RDD representation of Hbase table.
2264. Perform aggregation on Spark RDD.
2275. For row count - loop via all RDD partitions and get aggregated row count.
2286. You can run over data using multiple executors - one per HBase region, then process aggregations with even more executores - one per RDD partition.
229What are typical uses of Spark? Micro-batch processing engine.
230Explain Structured Streamin in Spark You can take same operations you perform in batch mode and run them in streamin mode.
231What are usecases of Structured Streaming? Write prototype as regular batch job and then release it in a streaming mode.
232Retail data used by Streaming Job.
233USECASE: Explain how retail data can be used by Streaming Job? 1. Create DataFrame, load data (or read stream), define number of shuffle partitions (default is 200).
2342. Add total cost column (to see of what days customer spent the most).
2353.Create window function oved time-series column.
236Explain Spard RDDs Relient Distributed Datasets.
237Virtualy everything in Spark is built on top of RDDs.
238They are lower level than DataFrames.
239They are partitioned.
240Use them for raw unprocessed unstructured data.
241You should not use RDDs directly unless you are maintaining old Spark code.
242Use onlt Structured API in modern Spark apps (do not use RDDs directly)
243List collections you can create using Spark Structured API Datasets.
244DataFrames.
245SQL Tables and views.
246Applied both to streaming and batch computation.
247Easy to migrate from batch to streaming and v.v.
248Explain Spark data flow. 1. Spark is a distributed programming model in which the user specifies transformations.
2492. Multiple transformationsbuild up a directed acyclic graph of instructions.
2503. An action begins the process of executing the graph of instructions as a single job by breaking it down to into stages and tasks to execute across the cluster.
2514. The logical structures that we manipulate with transformations and actions are DataFrames and Datasets.
2525. To create new DataFrame or Dataset you call transformation.
253To start computation or convert to native language types you call an action.
254Explain Spark's structured collections. Theare ate 2 types: DataFrames and Datasets as of Spark 2.2.
255They are distributed table-like collections with weel defined columns and rows.
256Each column must have same number of rows (or Null).
257Each column has type.
258SDs represent immutable, laizily evaluated (executed only upon action) plans (sequence of functions) that specify what operation to apply to data residing at a location to generate some output.
259Tables and views are the same thiong as DataFrame code.
260Explain schemas in Spark. Scheam defines the column names and types of DataFrame.
261You can define them manually or on the fly (schema on read).
262Explain Catalyst in Spark. Catalyst is internal engine that is used for type mapping from different language APIs (like Scala, Python).
263It maintains type lookup table for each language.
264Doing so itopens up variety of execution optimizations.
265This means any language call is translated to Spark types and executed "purely in Spark".
266Explain diffs between DataFrame and Dataset 1. DF is untyped, DS is typed.
2672. Types evaluated at runtype in DataFrames and at compile type in Datasets.
2683. DSs are available only in JVM based langs like Scala, Java. We specify types using case classes (Scala) and Java beans (Java).
2694. To Scala DFs are DSs with generic type ROW (Row type is Spark's internal representation of optimized in-memory format for computations). This internal Spark's type does not incur instantiation and garbage collection cost (comparing to JVM types).
270What's the meaging of $ sign in Scala? Allows to designate a string as a special string that can be referred in expression. Along with tick (`) It's a shorthand way to refer to a column by name.
271How to you list columns of a DataFrame? .columns
272USECASE: How to get unique, distinct values in DataFrame? df.select("ORIGIN_COUNTRY_NAME", "DEST_COUNTRY_NAME").distinct().limit(5).show()
273Disadvatage of writing Spark UDF in Python? Serializing Python is expensite.
274No memory management.
275Python competes for memory with Spark.
276How Spark handles after join duplicate column names? 1. Rename column before join.
2772. Drop column after join.
2783. Change join expression from Boolean expression to string sequence.
279Explain Spark communication strategies.
280Explain shuffle join all-to-all
281Explain broadcast join
282What are cases of shuffle join? 1. Big table -to -Big table.
283Every node talks to every other node and they share data according to which node has a certain key or set of keys (on which you are joining).
284This join is expensive if your data is not partitioned well.
2852. Big table-to-Small table.
286DataFrame is broadcasted to all worker nodes (if it fits memory).
287You can use broadcast hint (person.join(broadcast(graduateProgram), joinExpr).explain()).
288You can use hints on SQL level: MAPJOIN, BROADCAST, BROADCASTJOIN
289When would you use Datasets vs DataFrames When you need type safety and you are willing to accept perforance cost.
290For every row Spark converts the Row format into the object you specified (case class of Java class). This conversion slows down operations.
291Use it in cases which cannot be expressed using Structured API.
292Use it in case you have business logic encoded in function.
293Use it if code correctness is a priority.
294Use it when you want to reuse variety of transformations of entire rows between single-node workloads.
295What is lower level API in Spark and when do you need to use it? RDDS, SparkContext, distributed shared variables(broadcast variables and accumulators).
296Use it when you need tight control over physical data placement across the cluster.
297You need to maintain some legacy RDDs.
298You need custom shared variable manipulation.
299Explain RDD record. RDD record is Java or Python object.
300You can store anything you want in these
301Describe main RDD properties. List of partitions.
302A function for computing each split.
303A list of dependencies on other RDDs.
304Optionally, a Partitioner for key-value RDDs (to say that the RDD is hash-partitioned).
305What RDDs feature is not present in DataFrames API? Checkpointing - act of saving an RDD to disk so that future references to RDDs use those snapshots instead of recomputing RDDs from source.
306How to could RDD records using shell tools (piping). words.pipe("wc -l").collect()
307What's glom? glom is function that takes every partition in your dataset and comverts them to arrays.
308Usefult when you want to collect data on a Driver as arrays.
309What's the advantage of repartitioning in RDD? Increasing number of partitions can increase the level of parallelism when operating in map-and filter-type operations.
310Repartition will cause shuffle.
311How do you use custom partitioning. Use custom partitioned RDD by appying custom partitioner.
312Convert RDD to DataFrame or Dataset.
313How do you perform custom partitioning of RDD? You need to implement your own class that extends Partitioner.
314Spark has 2 built-in Partitioners - HashPartitioner and RangePartitioner.
315They will rowk for discrete and continuous values.
316How do you improve serialization speed in Spark? Using Kryo.
317Requires you to register your class.
318What's the purpose of accumulator variables? Add togather data from all tasks into shared result.
319For example: implement counter of parsing errors across cluster.
320What's the purpose of broadcast variables? Lets you save large immutable value on all the worker nodes and resuse it across many Spark actions without resending it to the cluster and without encapsulating that variable in a function closure.
321Spark transfers data more efficiently around the cluster using broadcasts.
322How do you update accumulator? Inside action.
323Restarted tasks will not update the value.
324Accumulator updates are not guarantted within lazy transformation like map.
325What are types of accumulators? Named and unnamed.
326Named will be seen iin Spark UI.
327How do you crate your custom accumulator? You need to subclass AccumulatorV2 class and implement it's abstract methods.
328What's the difference between Cluster manager driver and Spark Driver? CM Driver is tied to physical representation of cluster rather than processes (as they are in Spark).
329What is an alternative cluster manager except YARN? Apache Mesos
330What are Spark app execution modes? 3 modes:
331Cluster mode.
332Client mode.
333Local mode.
334What is client execution mode in Spark? Same as cluster plus Spark driver remains on the client machine that submitted the application.
335Client machine is responsible for maintaining the Spark driver process and the cluster manager maintains the executor processes.
336So this client machine is not part of the cluster and is not colocated with it.
337This type of machine called gateway machine or edge node.
338What is local execution mode in Spark? Runs entire Spark application on a single machine.
339What is stage in Spark? Stages in Spark represent groups of tasks that can be executed togather to compute the same operation on multiple machines.
340Engine starts new operation after operation called Shuffle.
341Shuffle represents physical repartitioning of data. (eg sorting or grouping by sending records from the same key to the same node).
342Twhat is task in Spark? Stages in Spark consists of Tasks.
343Task corresponds to a combination of blocks of data and a set of transformations that will run on a single executor.
344Are all data operations in Spark in-memory? No, only, the stages which Spark can pipeline will be in-memory.
345Thise have to be on the same node without shuffling.
346Others will be persisted on disk and reused.
347What is pipelining in Spark? It's in-memory computation process.
348It occurs at RDD level.
349Can shuffle be done in -memory in Spark? No, all shuffle operations will be persisted to disk.
350Partitions can be reused across multiple jobs.
351What is "data locality". It's important aspect in shared cluster environment.
352It specifies a preference for certaain nodes that hold certain data, rather than having to send those blocks to other node and perform data task there.
353If you run your storage system on the same node as Spark, and the system supports locality hints, Spark will try to schedule tasks close to data.
354How do you compute statistics? ANALYZE TABLE table_name COMPUTE STATISTICS
355It can be done only on named tables not on arbitrary DataFrames or RDDs.
356Cost based optimizer JIRA SPARK-16026
357How does Spark manages shuffle service. There's ExternalShuffleService that runs as standalone app outside Executir process.
358It serves shuffle blocks to executors at all time.
359Give me example of UDF optimization. There's ongoing work to make data available to Python UDFs in batches.
360Vectorized UDF extension for Python that givess your code multiple records at once using Pandas data frame.
361What's the difference in caching for RDD and Structures API? RDD will cache physical data (bits).
362IN Structured API caching is done using physical plan.
363So we store plan, not data.
364This may cause you to access other's cached data.
365What are different cache storage levels? MEMORY_ONLY -deserialized RDD Java objects in JVM memory.
366MEMORY_AND_DISK - partitions that do not fit JVM memory will be stored on disk.
367MEMORY_ONLY_SER - keep RDD partitions as serialised Java objects. CPU bound.
368MEMORY_AND_DISK_SER-save partitions that do noot fit JVM memory to disk (instead of recomputing them).
369DISK_ONLY - all on disk at all times.
370MEMOTY_ONLY_2+MEMORY_AND_DISK_2- replicate each partition on 2 nodes.
371OFF_HEAP - store data off heap.
372What is stream processing? It's the act of continuously incorporating new data to compute a result.
373Explain Spark streaming APIs. Dstream API - microbutch processing.
374It has declarative (function based ) API but no support for event time.
375Supports joins with static datta.
376Disadvantage: Based on pure Java/Python objects not DataFrames.
377Purely based on processing time vs event time.
378Can only operate in micro batch fashon.
379What is event time processing? It means that rather thsn processing data according to the time it reaches your system, you process it according to the time that it was generated.
380What are watermarks? It's a feature of streaming systems that asllow you to specify how late they expect to see data in event time.
381It sets the limit how long they need to remember old data.
382What are the limitations of Structured Streaming API? Users cannot sort stream that are not aggregated.
383Users cannot perform multiple levels of aggregation without using Stateful Processing.
384What is watermark? It's amount of time following a given event or set of events after which we do not expect to see any more data from that timestamp.
385How do you enable fault tolerance in Spark application? Configure your application to use checkpointing and write-ahead logs.
386Upon failure - restart pointing to checkpoint location.
387How to check current progress of a query in Spark? Run query.recentProgress
388What is MiMa? Migration Manager for Scala/Spark.
389It catches binary incompatibilities between releases.
390Which transformations are not 100% lazy? sortByKey needs to evaluate RDD to determine range of data, so it involves both a transformation and an action.
391What are advantages of lazy evaluation? It allows Spark to combine operations that do not require communication with the driver (one-on-one transformations).
392Spark can chain togather operaations with narrow dependencies.
393Spark app will fail only at a point of action.
394What is persist() function doing? Lets user control how RDD is stored.
395Stores RDDs as deserialized objects in memory.
396Evicts recently used partition (LRU caching) if space is required.
397USECASE: Kafka+Spark+Dashboards 1. Create historical archive of all events using Parquet.
3982. Perform low latency event-time aggregation and push events back to Kafka for other consumers.
3993. Perform batch reporting on the data stored in compacted topic in Kafka.
400Give example of Spark and Parquet workloads. Partition pruning.
401Column projection.
402Predicate/filter push-down.
403Investigating Parquet metadata.
404Measuring Spark metrics.
405Spark performs partition discovery with API.
406Is it possible to use ANSI SQL to query data in Spark? Yes, SIMBA ODBC driver.
407It maps SQL to Spark SQL.
408Supports SQL-92 standard.
409Supports JDBC 4.0
410Is it possible to query streamed external table in Spark? Yes.
411Set the config spark.sql.parquet.cacheMetadata to false
412refresh the table before the query: sqlContext.refreshTable("my_table")
413How do you execute SQL over DataFrame? Register DF as table.
414df.createOrReplaceTempView("src")
415Can you update Spark's table/DF/Ds/RDD? No, thise are immutable.
416You can create new column on the fly and then delete old and rename new.
417You can modify target db and then refresh from it.
418DataFrameWriter can either append or overwrite existing table.
419What are limitations of Hive Exchange Partitions? You cannot exchange already existing partition.
420(move existing anf then you can exchange).
421PE will fail if index exists.
422PE will fail for transactional tables.
423Source and target schems have to be the same.
424How do I exchange partitions in the same table in Hive? You can follow these discrete steps:
425partition rename
426data copy
427old partition drop
428table repair
429What happens under the hood during Partition Map (Exchange) in Hive? 1. Non coordinator nodes sends its local state.
4302. Coordinator merges local partition information.
4313. Coordinator sends full map to other nodes.
432Is COMMIT or ROLLBACK supported in Hive? NO, as of Hive 0.14 it's not supported.
433All language operations are atomic.
434What are requirements if you want table to support transactions in Hive? ORC file format.
435Tables must be bucketed.
436Can external tables in Hive be ACID (support transactions)? No, they are out of control of the compactor.
437What are data isolation levels in Hive? At this time only snapshot level isolation is supported.
438When a given query starts it will be provided with a consistent snapshot of the data.
439There is no support for dirty read, read committed, repeatable read, or serializable.
440With introduction of BEGIN the snapshot isolation will extend from single query to transaction.
441What are limitations of transactional tables in Hive? LOAD DATA is not supported.
442You cannot Exchange Partition.
443You can use only ORC file format.
444Explain basic transaction design in Hive. HDFS does not support inplace changes to files.
445It does not offer read consistency when writers appending to files being read by users.
446Explain how you update table in Spark. It looks like I can update Spark DataFrame only if Hive is a backend.
447On Hive side you need to have transactional table for MERGES/DMLs.
448When you update a record in a table Hive will simply create delta file.
449When you query Hive table you just updated it will reconstruct updated record from reading main file and delta files.
450You can run “compaction†process to create one file from all delta files and main file.
451What is ACID and why should u use it? ACID stands for 4 traits of database transactions:
452Atomicity - completes or fails and does not leave partial data.
453Consistency - once completed it's visible to all operations.
454Isolation - incomplete operation is not visible to other users.
455Durability - operation persists in case of machine or system failure.
456How ACID isolation is provided in Hive? Using ZooKeeper or inmemory locking.
457What is the level of ACID cemantics in hive. pre Hive 0.13 - partition level,
458post 0.13 - row level.
459Why transactions were added to Hive? To address following use cases:
4601. Streaming ingest of data.
4612. Slow changing dimensions.
4623. Data restatement.
4634. Bulk updates using SQL MERGE.
464How do you concatenate Hive table? ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE
465can be used to merge small ORC files into a larger file, starting in Hive 0.14.0.
466The merge happens at the stripe level, which avoids decompressing and decoding the data.
467How do you overcome performance degradation in Hive releated to creation of delta files? you use Compactor
468What is Compactor in Hive? It's a set of background processes running inside Metastore to support ACID system.
469It consists of Initiator, Worker, Cleaner, AcidHouseKeeperService.
470Explain Delta file compaction. There are 2 types of compaction: major and minor.
471Minor - takes set of existing delta files and rewrites them to a single delta file per bucket.
472Major- creates new major file from old major file and any delta files.
473It's done in a background and do not prevent concurrent reads and writes of the data.
474After compaction system waits until all readers of the old files have finished and then removes old files.
475What's the difference between Streamin API and Streamin Mutation API? S API will only insert, but SM API can also update record.
476Baset on 2 different transaction models.
477S API: append continuous stream of data into Hive table by batching small sets of writes into multiple short-lived transactions.
478SM API: infrequently apply large sets of mutations to data set in an atomic fashon. - either all or none. This requires single long lived transaction. Mutations are performed to HDFS via ORC SPIs bypassing MetaStore. You can configure MetaStore to perform regular data compactions.
479How do you appy mutation to a record? To apply mutation we need record identifier.
480Hive internally uses RecordIdentifiers stored in a virtual ROW_ID column (it uniquely identifies record in ACID table).
481You can get rowid via AcidRecordReader.getRecordIdentifier()
482What are Data requirements of SM API? 1. Order records by ROW_ID.originalTxn, then ROW_ID.rowId.
4832. Assign a ROW_ID conaining a computed bucketid to each record to be inserted.
4843. Group by table partiton value, then ROW_ID.bucketId.
485What are Streaming requirements of SM API? 1. 'stored as orc' has to be specified during table creation.
4862. Hive table must be bucketed and not sorted. Use something like 'clustered by (cloname) into 10 buckets.
4873. User process must have permissions to write to the table or partition.
4884.Hive transactions must be configured for each table.
489Can you move records between buckets in Hive? No.exception will be thrown.
490This means you cannot UPDATE bucket columns.
491Is deadlock possible in SM API? Yes, it’s possible if other streams are writing into the same table/partition.
492Does Hive have deadlock detector? No. check HIVE-9675
493How do you read data using SM API? 1. Get transaction list from Metastore. (ValidTxnList).
4942. Aquire lock with metastore and issue heartbeats (LockImpl).
4953. Configure the OrcInputFormat and then read the data.
4964. Release the lock.
497How do you control Spark's memory prioritization? use persistencePriority() function
498In which 3 ways RDDs can be created? 1. Transforming existing RDD.
4992. From SparkContext.
5003. Converting DataFrame or Dataset (created from SparkSession).
501Will table rename move HDFS files? As of v 0.6 rename of managed tables moves it's HDFS location.
502As of 2.2.0 files are moved only if there's no LOCATION clause.
503How to change field delimiter in Hive table?
504ALTER TABLE table_name SET SERDEPROPERTIES ('field.delim' = ',');
505Give me example of constraint creation in HIVE; ALTER TABLE table_name ADD CONSTRAINT constraint_name PRIMARY KEY (column, ...) DISABLE NOVALIDATE;
506ALTER TABLE table_name ADD CONSTRAINT constraint_name FOREIGN KEY (column, ...) REFERENCES table_name(column, ...) DISABLE NOVALIDATE RELY;
507What partition operation are allowed in Hive? Add, rename, exchange (move), or (un) archive.
508What are alternative ways of adding partitions in Hive? Copy files to HDFS location and then make metastore aware of it.
509You can use metastore check command (MSCK) or on AWS EMR use RECOVER PARTITIONS.
510How do you "directly" add partitions to Hive? By using "hadoop fs -put " command.
511You have to explicitly run ALTER table ADD PARTITION to make Hive aware of directly added partition.
512Alternatively run : MSCK REPAIR TABLE table_name;
513It will add metadata about partitions (ALL OF THEM) to the Hive metastore.
514Explain table COMPACTION. In Hive release 0.13.0 transactions introduced.
515ALTER TABLE can request COMPACTION.
516By default system detects the need for compation and initioates it.
517You can turn it off. But then use ALTER TABLE … COMPACT manually.
518It will enqueue request and return.
519Use AND WAIT to wait fo compation operation to complete.
520The compation type can be MAJOR and MINOR.
521Explain table CONCATENATE. If table or partition contains many small RCFiles or ORC files then CONCATENATE will merge them in one larger file.
522RCFiles will merge at block level (uncompress then compress).
523ORC files will merge at stripe level (avoiding overhead of decompressing and decoding the data).
524How do you display view definition in Hive? Use SHOW CREATE TABLE <view_name>
525What is CTE? A Common Table Expression is a temporary result set derived from a simple query specified in a WITH clause.
526Are there indexes in Hive? Index support was removed in Hive 3.0.
527What is SerDe? Serializer/Deserializer.
528Hive uses SerDe interface for IO.
529You can write your own SerDe fro your own data format.
530Hive uses SerDe (and FileFormat) to read and write table rows.
531HDFS files --> InputFileFormat --> <key, value> --> Deserializer --> Row object.
532Row object --> Serializer --> <key, value> --> OutputFileFormat --> HDFS files.
533Key is ignored. Row object is stored into a "value".