101 Best Data Science Interview Questions 2018
 1. What is Data Science?
 2. How is it different from Big Data and Data Analytics?
 3. Differentiate between Data Science , Machine Learning and AI.
 What is logistic regression? Or State an example when you have used logistic regression recently.
 5. Compare R and Python programming?
 6. What is Linear Regression?
 7.What is Interpolation and Extrapolation?
 8.What is power analysis?
 9.What is Kmeans? How can you select K for Kmeans?
 10.What is Collaborative filtering?
 11.What is the difference between Cluster and Systematic Sampling?
 12. Are expected value and mean value different?
 13. What is Machine Learning ?
 14. How would you create a taxonomy to identify key customer trends in unstructured data?
 15. Which technique is used to predict categorical responses?
 16. Why data cleaning plays a vital role in analysis?
 17. What are Recommender Systems?
 18. Differentiate between univariate, bivariate and multivariate analysis.
 19. What is Interpolation and Extrapolation?
 20. What is power analysis?
 21.What is Collaborative filtering?
 22.What is the difference between Cluster and Systematic Sampling?
 23. Are expected value and mean value different?
 24. Explain the various benefits of R language?
 25.How do Data Scientists use Statistics?
 26. How machine learning is deployed in real world scenarios?
 27. What are the various aspects of a Machine Learning process?
 28.What is Linear Regression?
 29. How is Data modeling different from Database design?
 30. What does Pvalue signify about the statistical data?
 31. What is the difference between Supervised Learning an Unsupervised Learning?
 32. What is the goal of A/B Testing?
 What is an Eigenvalue and Eigenvector?
 How can outlier values be treated?
 33. How can you assess a good logistic model?
 34. Can you write the formula to calculat Rsquare?
 35. Compare R and Python programming languages for Predictive Modelling.
 Feature
 Python is Better
 R Language is Better
 36. Explain about data import in R language
 37. Two vectors X and Y are defined as follows – X < c(3, 2, 4) and Y < c(1, 2). What will be output of vector Z that is defined as Z < X*Y.
 38. How missing values and impossible values are represented in R language?
 39. R language has several packages for solving a particular problem. How do you make a decision on which one is the best to use?
 40. Which function in R language is used to find out whether the means of 2 groups are equal to each other or not?
 41. What is the best way to communicate the results of data analysis using R language?
 42. How many data structures does R language have?
 43. What is the process to create a table in R language without using external files?
 44. Explain about the significance of transpose in R language
 45. What are with () and BY () functions used for?
 46. dplyr package is used to speed up data frame management code. Which package can be integrated with dplyr for large fast tables?
 47. In base graphics system, which function is used to add elements to a plot?
 48.What are the different type of sorting algorithms available in R language?
 49. What is the command used to store R objects in a file?
 50. What is the best way to use Hadoop and R together for analysis?
 51. What will be the output of log (5.8) when executed on R console?
 52. How is a Data object represented internally in R language?
 53. Which package in R supports the exploratory analysis of genomic data?
 54.What is the difference between data frame and a matrix in R?
 55.How can you add datasets in R?
 56. How do you split a continuous variable into different groups/ranks in R?
 57. What are factor variable in R language?
 58. What is the memory limit in R?
 59.What are the data types in R on which binary operators can be applied?
 60. How do you create log linear models in R language?
 61. What will be the class of the resulting vector if you concatenate a number and NA?
 62. Write a function in R language to replace the missing value in a vector with the mean of that vector.
 63.What happens if the application object is not able to handle an event?
 64. Differentiate between lapply and sapply.
 65. Differentiate between seq (6) and seq_along (6)
 66. How will you read a .csv file in R language?
 67. How do you write R commands?
 68. How can you verify if a given object “X” is a matric data object?
 69. What do you understand by element recycling in R?
 70. How can you verify if a given object “X” is a matrix data object?
 71. How will you measure the probability of a binary response variable in R language?
 72. What is the use of sample and subset functions in R programming language?
 73. What are various steps involved in an analytics project?
 74. How can you iterate over a list and also retrieve element indices at the same time?
 75. During analysis, how do you treat missing values?
 76.What areas of machine learning are you most familiar with?
 77.What sort of optimization problem would you be solving to train a support vector machine?
 78. Tell me about positives and negatives of using Gaussian processes / general kernel methods approach to learning.
 79. How does a kernel method scale with the number of instances (e.g. with a Gaussian rbf kernel)?
 80. Describe ways to overcome scaling issues.
 81. What are some tools for parallelizing machine learning algorithms?
 82.In Python, do you have a favorite/least favorite PEP?
 Tableau Interview Questions
 83. What is Data Visualization?
 84. What are the differences between Tableau desktop and Tableau Server?
 85. Define parameters in Tableau and their working.
 86. Differentiate between parameters and filters in Tableau.
 87. What are fact table and Dimension table in Tableau?
 88. What are Quick Filters in Tableau?
 89. State limitations of parameters in Tableau.
 90. What is aggregation and disaggregation of data in Tableau?
 91. What is Data Blending?
 92. What is Content Filter?
 93. What are the limitations of context filters?
 94. Name the file extensions in Tableau.
 95. Explain the difference between .twb and .twbx
 96. What are Extracts and Schedules in Tableau server?
 97. Name the components of a Dashboard
 98. How to view underlying SQL Queries in Tableau?
 What is Page shelf?
 100. How to do Performance Testing in Tableau?
 101. How many maximum tables can you join in Tableau?
1. What is Data Science?
Data Science is a blend of Statistics, technical skills and business vision which is used to analyze the available data and predict the future trend
2. How is it different from Big Data and Data Analytics?
Big Data  Data Science  Data Analytics 
Huge volumes of datastructured, unstructured and semistructured  Deals with slicing and dicing the data  Contributing operational insights into complex business scenarios 
Requires a basic knowledge of statistics and mathematics  Requires indepth knowledge of statistics and mathematics  Requires moderate amount of statistics and mathematics 
3. Differentiate between Data Science , Machine Learning and AI.
Data Science is not exactly a subset of machine learning but it uses machine learning to analyse and make future predictions. A subset of AI that focuses on narrow range of activities. A wide term that focuses on applications ranging from Robotics to Text Analysis.  A subset of AI that focuses on narrow range of activities.  A wide term that focuses on applications ranging from Robotics to Text Analysis. 

What is logistic regression? Or State an example when you have used logistic regression recently.
Logistic Regression often referred as logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent for election campaigning of a particular candidate, the amount of time spent in campaigning, etc.
5. Compare R and Python programming?
R: The best part about R is that it is an Open Source tool and hence used generously by academia and the research community. It is a robust tool for statistical computation, graphical representation and reporting. Due to its open source nature it is always being updated with the latest features and then readily available to everybody.
Python: Python is a powerful open source programming language that is easy to learn, works well with most other tools and technologies. The best part about Python is that it has innumerable libraries and community created modules making it very robust. It has functions for statistical operation, model building and more.
R and Python are two of the most important programming languages for Machine Learning Algorithms.
6. What is Linear Regression?
Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
7.What is Interpolation and Extrapolation?
Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.
8.What is power analysis?
An experimental design technique for determining the effect of a given sample size.
9.What is Kmeans? How can you select K for Kmeans?
k–means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.
10.What is Collaborative filtering?
The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.
11.What is the difference between Cluster and Systematic Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list,it is progressed from the top again. The best example for systematic sampling is equal probability method.
12. Are expected value and mean value different?
They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.
For Sampling Data
Mean value is the only value that comes from the sampling data.
Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean.
For Distributions
Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population.
13. What is Machine Learning ?
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it learn for themselves.
The process of learning begins with observations or data, such as examples, direct experience, or instruction, in order to look for patterns in data and make better decisions in the future based on the examples that we provide. The primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly
14. How would you create a taxonomy to identify key customer trends in unstructured data?
The best way to approach this question is to mention that it is good to check with the business owner and understand their objectives before categorizing the data. Having done this, it is always good to follow an iterative approach by pulling new data samples and improving the model accordingly by validating it for accuracy by soliciting feedback from the stakeholders of the business. This helps ensure that your model is producing actionable results and improving over the time.
15. Which technique is used to predict categorical responses?
Classification technique is used widely in mining for classifying data sets.
16. Why data cleaning plays a vital role in analysis?
Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.
17. What are Recommender Systems?
A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
18. Differentiate between univariate, bivariate and multivariate analysis.
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
19. What is Interpolation and Extrapolation?
Estimating a value from 2 known values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.
20. What is power analysis?
An experimental design technique for determining the effect of a given sample size.
21.What is Collaborative filtering?
The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.
22.What is the difference between Cluster and Systematic Sampling?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection, or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list,it is progressed from the top again. The best example for systematic sampling is equal probability method.
23. Are expected value and mean value different?
They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.
For Sampling Data
Mean value is the only value that comes from the sampling data.
Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean.
For Distributions
Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population
24. Explain the various benefits of R language?
The R programming language includes a set of software suite that is used for graphical representation, statistical computing, data manipulation and calculation.
 An extensive collection of tools for data analysis
 Operators for performing calculations on matrix and array
 Data analysis technique for graphical representation
 A highly developed yet simple and effective programming language
 It extensively supports machine learning applications
 It acts as a connecting link between various software, tools and datasets
 Create high quality reproducible analysis that is flexible and powerful
 Provides a robust package ecosystem for diverse needs
 It is useful when you have to solve a dataoriented problem
25.How do Data Scientists use Statistics?
Statistics helps Data Scientists to look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Data Scientists can learn about the consumer behavior, interest, engagement, retention and finally conversion all through the power of insightful statistics. It helps them to build powerful data models in order to validate certain inferences and predictions. All this can be converted into a powerful business proposition by giving users what they want at precisely when they want it.
26. How machine learning is deployed in real world scenarios?
Here are some of the scenarios in which machine learning finds applications in real world:
 Ecommerce: Understanding the customer churn, deploying targeted advertising, remarketing
 Search engine: Ranking pages depending on the personal preferences of the searcher
 Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions
 Medicare: Designing drugs depending on the patient’s history and needs
 Robotics: Machine learning for handling situations that are out of the ordinary
 Social media: Understanding relationships and recommending connections
 Extraction of information: framing questions for getting answers from databases over the web
27. What are the various aspects of a Machine Learning process?
Domain knowledge
 This is the first step wherein we need to understand how to extract the various features from the data and learn more about the data that we are dealing with. It has got more to do with the type of domain that we are dealing with and familiarizing the system to learn more about it.
Feature Selection
 This step has got more to do with the feature that we are selecting from the set of features that we have. Sometimes it happens that there are a lot of features and we have to make an intelligent decision regarding the type of feature that we want to select to go ahead with our machine learning endeavor.
Algorithm
 This is a vital step since the algorithms that we choose will have a very major impact on the entire process of machine learning. You can choose between the linear and nonlinear algorithm. Some of the algorithms used are Support Vector Machines, Decision Trees, Naïve Bayes, KMeans Clustering, etc.
Training
 This is the most important part of the machine learning technique and this is where it differs from the traditional programming. The training is done based on the data that we have and providing more real world experiences. With each consequent training step the machine gets better and smarter and able to take improved decisions.
Evaluation
 In this step we actually evaluate the decisions taken by the machine in order to decide whether it is up to the mark or not. There are various metrics that are involved in this process and we have to closed deploy each of these to decide on the efficacy of the whole machine learning endeavor.
Optimization
 This process involves improving the performance of the machine learning process using various optimization techniques. Optimization of machine learning is one of the most vital components wherein the performance of the algorithm is vastly improved. The best part of optimization techniques is that machine learning is not just a consumer of optimization techniques but it also provides new ideas for optimization too.
Testing
 Here various tests are carried out and some these are unseen set of test cases. The data is partitioned into test and training set. There are various testing techniques like crossvalidation in order to deal with multiple situations.
28.What is Linear Regression?
It is the most commonly used method for predictive analytics. The Linear Regression method is used to describe relationship between a dependent variable and one or independent variable. The main task in the Linear Regression is the method of fitting a single line within a scatter plot. The Linear Regression consists of the following three methods:
 Determining and analyzing the correlation and direction of the data
 Deploying the estimation of the model
 Ensuring the usefulness and validity of the model
 It is extensively used in scenarios where the cause effect model comes into play. For example you want to know the effect of a certain action in order to determine the various outcomes and extent of effect the cause has in determining the final outcome.
29. How is Data modeling different from Database design?
Data Modeling: It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying the data modeling techniques.
Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.
30. What does Pvalue signify about the statistical data?
Pvalue is used to determine the significance of results after a hypothesis test in statistics. Pvalue helps the readers to draw conclusions and is always between 0 and 1.
 P Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
 Pvalue <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
 Pvalue=0.05is the marginal value indicating it is possible to go either way.
31. What is the difference between Supervised Learning an Unsupervised Learning?
If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.
32. What is the goal of A/B Testing?
It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.
What is an Eigenvalue and Eigenvector?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
How can outlier values be treated?
Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values.The most common ways to treat outlier values –
1) To change the value and bring in within a range
2) To just remove the value.
33. How can you assess a good logistic model?
There are various methods to assess the results of a logistic regression analysis
 Using Classification Matrix to look at the true negatives and false positives.
 Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
 Lift helps assess the logistic model by comparing it with random selection.
R Interview Questions
34. Can you write the formula to calculat Rsquare?
RSquare can be calculated using the below formular –
1 – (Residual Sum of Squares/ Total Sum of Squares)
35. Compare R and Python programming languages for Predictive Modelling.
Feature  Python is Better  R Language is Better 
Model Building  Both are Similar  Both are Similar 
Model Interpretability  Not better than R.  R is better 
Production  Python is Better  Not better than Python 
Community Support  Not better than R.  R has good community support over Python. 
Data Science Libraries  Both are similar.  Both are similar 
Data Visualizations  Not better than R  R has good data visualizations libraries and tools. 
Learning Curve  Learning Python is easier than learning R.  R has a steep learning curve. 
36. Explain about data import in R language
R Commander is used to import data in R language. To start the R commander GUI, the user must type in the command Rcmdr into the console. There are 3 different ways in which data can be imported in R language
 Users can select the data set in the dialog box or enter the name of the data set (if they know).
 Data can also be entered directly using the editor of R Commander via Data>New Data Set. However, this works well when the data set is not too large.
 Data can also be imported from a URL or from a plain text file (ASCII), from any other statistical package or from the clipboard.
37. Two vectors X and Y are defined as follows – X < c(3, 2, 4) and Y < c(1, 2). What will be output of vector Z that is defined as Z < X*Y.
In R language when the vectors have different lengths, the multiplication begins with the smaller vector and continues till all the elements in the larger vector have been multiplied.
The output of the above code will be –
Z < (3, 4, 4)
38. How missing values and impossible values are represented in R language?
NaN (Not a Number) is used to represent impossible values whereas NA (Not Available) is used to represent missing values. The best way to answer this question would be to mention that deleting missing values is not a good idea because the probable cause for missing value could be some problem with data collection or programming or the query. It is good to find the root cause of the missing values and then take necessary steps handle them.
39. R language has several packages for solving a particular problem. How do you make a decision on which one is the best to use?
CRAN package ecosystem has more than 6000 packages. The best way for beginners to answer this question is to mention that they would look for a package that follows good software development principles. The next thing would be to look for user reviews and find out if other data scientists or analysts have been able to solve a similar problem.
40. Which function in R language is used to find out whether the means of 2 groups are equal to each other or not?
t.tests ()
41. What is the best way to communicate the results of data analysis using R language?
The best possible way to do this is combine the data, code and analysis results in a single document using knitr for reproducible research. This helps others to verify the findings, add to them and engage in discussions. Reproducible research makes it easy to redo the experiments by inserting new data and applying it to a different problem
42. How many data structures does R language have?
R language has Homogeneous and Heterogeneous data structures. Homogeneous data structures have same type of objects – Vector, Matrix ad Array. Heterogeneous data structures have different type of objects – Data frames and lists
43. What is the process to create a table in R language without using external files?
MyTable= data.frame ()
edit (MyTable)
The above code will open an Excel Spreadsheet for entering data into MyTable.
44. Explain about the significance of transpose in R language
Transpose t () is the easiest method for reshaping the data before analysis.
45. What are with () and BY () functions used for?
With () function is used to apply an expression for a given dataset and BY () function is used for applying a function each level of factors.
46. dplyr package is used to speed up data frame management code. Which package can be integrated with dplyr for large fast tables?
data.table
47. In base graphics system, which function is used to add elements to a plot?
boxplot () or text ()
48.What are the different type of sorting algorithms available in R language?
Bucket Sort
Selection Sort
Quick Sort
Bubble Sort
Merge Sort
49. What is the command used to store R objects in a file?
save (x, file=”x.Rdata”
50. What is the best way to use Hadoop and R together for analysis?
HDFS can be used for storing the data for longterm. MapReduce jobs submitted from either Oozie, Pig or Hive can be used to encode, improve and sample the data sets from HDFS into R. This helps to leverage complex analysis tasks on the subset of data prepared in R
51. What will be the output of log (5.8) when executed on R console?
Executing the above on R console will display a warning sign that NaN (Not a Number) will be produced because it is not possible to take the log of negative number
52. How is a Data object represented internally in R language?
unclass (as.Date (“20161005″)
53. Which package in R supports the exploratory analysis of genomic data?
adegenet
54.What is the difference between data frame and a matrix in R?
Data frame can contain heterogeneous inputs while a matrix cannot. In matrix only similar data types can be stored whereas in a data frame there can be different data types like characters, integers or other data frames.
55.How can you add datasets in R?
rbind () function can be used add datasets in R language provided the columns in the datasets should be same
56. How do you split a continuous variable into different groups/ranks in R?
57. What are factor variable in R language?
Factor variables are categorical variables that hold either string or numeric values. Factor variables are used in various types of graphics and particularly for statistical modelling where the correct number of degrees of freedom is assigned to them.
58. What is the memory limit in R?
8TB is the memory limit for 64bit system memory and 3GB is the limit for 32bit system memory.
59.What are the data types in R on which binary operators can be applied?
Scalars, Matrices ad Vectors.
60. How do you create log linear models in R language?
Using the loglm () function
61. What will be the class of the resulting vector if you concatenate a number and NA?
number
62. Write a function in R language to replace the missing value in a vector with the mean of that vector.
mean impute < function(x) {x [is.na(x)] < mean(x, na.rm = TRUE); x}
63.What happens if the application object is not able to handle an event?
The event is dispatched to the delegate for processing.
64. Differentiate between lapply and sapply.
If the programmers want the output to be a data frame or a vector, then sapply function is used whereas if a programmer wants the output to be a list then lapply is used. There one more function known as vapply which is preferred over sapply as vapply allows the programmer to specific the output type. The disadvantage of using vapply is that it is difficult to be implemented and more verbose.
65. Differentiate between seq (6) and seq_along (6)
Seq_along(6) will produce a vector with length 6 whereas seq(6) will produce a sequential vector from 1 to 6 c( (1,2,3,4,5,6)).
66. How will you read a .csv file in R language?
read.csv () function is used to read a .csv file in R language. Below is a simple example –
filcontent <read.csv (sample.csv)
print (filecontent)
67. How do you write R commands?
The line of code in R language should begin with a hash symbol (#).
68. How can you verify if a given object “X” is a matric data object?
If the function call is.matrix(X ) returns TRUE then X can be termed as a matrix data object.
69. What do you understand by element recycling in R?
If two vectors with different lengths perform an operation –the elements of the shorter vector will be reused to complete the operation. This is referred to as element recycling.
Example – Vector A <c(1,2,0,4) and Vector B<(3,6) then the result of A*B will be ( 3,12,0,24). Here 3 and 6 of vector B are repeated when computing the result.
70. How can you verify if a given object “X” is a matrix data object?
If the function call is.matrix(X) returns true then X can be considered as a matrix data object otheriwse not
71. How will you measure the probability of a binary response variable in R language?
Logistic regression can be used for this and the function glm () in R language provides this functionality.
72. What is the use of sample and subset functions in R programming language?
Sample () function can be used to select a random sample of size ‘n’ from a huge dataset.
Subset () function is used to select variables and observations from a given dataset.
73. What are various steps involved in an analytics project?
 Understand the business problem
 Explore the data and become familiar with it.
 Prepare the data for modelling by detecting outliers, treating missing values, transforming variables, etc.
 After data preparation, start running the model, analyse the result and tweak the approach. This is an iterative step till the best possible outcome is achieved.
 Validate the model using a new data set.
 Start implementing the model and track the result to analyse the performance of the model over the period of time.
74. How can you iterate over a list and also retrieve element indices at the same time?
This can be done using the enumerate function which takes every element in a sequence just like in a list and adds its location just before it.
75. During analysis, how do you treat missing values?
The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored.There are various factors to be considered when answering this question
 Understand the problem statement, understand the data and then give the answer.Assigning a default value which can be mean, minimum or maximum value. Getting into the data is important.
 If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.
 If you have a distribution of data coming, for normal distribution give the mean value.
 Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.
Machine Learning Interview Questions
76.What areas of machine learning are you most familiar with?
 supervised learning
 unsupervised learning
 anomaly detection
 active learning
 bandits
 gaussian processes
 kernel methods
 deep networks
77.What sort of optimization problem would you be solving to train a support vector machine?
maximize margin (best answer), quadratic program, quadratic with linear constraints, reference to solving the primal or dual form.
78. Tell me about positives and negatives of using Gaussian processes / general kernel methods approach to learning.
Positives – nonlinear, nonparametric. Negatives – bad scaling with instances, need to do hyperparameter tuning
79. How does a kernel method scale with the number of instances (e.g. with a Gaussian rbf kernel)?
Quadratic (referring to construction of the gram (kernel) matrix), cubic (referring to the matrix inversion)
80. Describe ways to overcome scaling issues.
nystrom methods/lowrank kernel matrix approximations, random features, local by query/near neighbors
81. What are some tools for parallelizing machine learning algorithms?
GPUs, Matlab parfor, write your own using low level primitives/RPC/MPI, mapreduce, spark, vowpal, graphlab, giraph, petuum, parameterserver
82.In Python, do you have a favorite/least favorite PEP?
Peps are python enhancement proposal. If you have a favorite or least favorite, it means they have knowledge of Python.
Tableau Interview Questions
83. What is Data Visualization?
A much advanced, direct, precise and ordered way of viewing large volumes of data is called data visualization. It is the visual representation of data in the form of graphs and charts, especially when you can’t define it textually. You can show trends, patters and correlations through various data visualization software and tools; Tableau is one such data visualization software used by businesses and corporates.
84. What are the differences between Tableau desktop and Tableau Server?
While Tableau desktop performs data visualization and workbook creation, Tableau server is used to distribute these interactive workbooks and/or reports to the right audience. Users can edit and update the workbooks and dashboards online or Server but cannot create new ones. However, there are limited editing options when compared to desktop.
Tableau Public is again a free tool consisting of Desktop and Server components accessible to anyone.
85. Define parameters in Tableau and their working.
Tableau parameters are dynamic variables/values that replace the constant values in data calculations and filters. For instance, you can create a calculated field value returning true when the score is greater than 80, and otherwise false. Using parameters, one can replace the constant value of 80 and control it dynamically in the formula.
86. Differentiate between parameters and filters in Tableau.
The difference actually lies in the application. Parameters allow users to insert their values, which can be integers, float, date, string that can be used in calculations. However, filters receive only values users choose to ‘filter by’ the list, which cannot be used to perform calculations.
Users can dynamically change measures and dimensions in parameter but filters do not approve of this feature.Most indepth, industryled curriculum in Tableau.
87. What are fact table and Dimension table in Tableau?
 Facts are the numeric metrics or measurable quantities of the data, which can be analyzed by dimension table. Facts are stores in Fact table that contain foreign keys referring uniquely to the associated dimension tables. The fact table supports data storage at atomic level and thus, allows more number of records to be inserted at one time. For instance, a Sales Fact table can have product key, customer key, promotion key, items sold, referring to a specific event.
 Dimensions are the descriptive attribute values for multiple dimensions of each attribute, defining multiple characteristics. A dimension table ,having reference of a product key form the fact table, can consist of product name, product type, size, color, description, etc.
88. What are Quick Filters in Tableau?
Global quick filters are a way to filter each worksheet on a dashboard until each of them contains a dimension. They are very useful for worksheets using the same data source, which sometimes proves to a disadvantage and generate slow results. Thus, parameters are more useful.
89. State limitations of parameters in Tableau.
Parameters facilitate only four ways to represent data on a dashboard (which are seven in quick filters). Further, parameters do not allow multiple selections in a filter.
90. What is aggregation and disaggregation of data in Tableau?
Aggregation and disaggregation in Tableau are the ways to develop a scatterplot to compare and measure data values. As the name suggests, aggregation is the calculated form of a set of values that return a single numeric value. For instance, a measure with values 1,3,5,7 returns 1. You can also set a default aggregation for any measure, which is not userdefined. Tableau supports various default aggregations for a measure like Sum, average, Median, Count and others.
Disaggregating data refers to viewing each data source row, while analyzing data both independently and dependently.
91. What is Data Blending?
Unlike Data Joining, Data Blending in tableau allows combining of data from different sources and platforms. For instance, you can blend data present in an Excel file with that of an Oracle DB to create a new dataset.
92. What is Content Filter?
The concept of context filter in Tableau makes the process of filtering smooth and straightforward. It establishes a filtering hierarchy where all other filters present refer to the context filter for their subsequent operations. The other filters now process data that has been passed through the context filter.
Creating one or more context filters improves performance as users do not have to create extra filters on large data source, reducing the queryexecution time.
You can create by dragging a filed into ‘Filters’ tab and then, RightClick that field and select ‘’Add to Context”
93. What are the limitations of context filters?
Tableau takes time to place a filter in context. When a filter is set as context one, the software creates a temporary table for that particular context filter. This table will reload each time and consists of all values that are not filtered by either Context or Custom SQL filter.Interested in a Tableau Career?
94. Name the file extensions in Tableau.
There are a number of file types and extensions in Tableau :
 Tableau Workbook (.twb).
 Tableau Packaged Workbook (.twbx).
 Tableau Datasource (.tds).
 Tableau Packaged Datasource (.tdsx).
 Tableau Data extract (.tde).
 Tableau Bookmark (.tdm).
 Tableau Map Source (.tms).
 Tableau Preferences (.tps)
95. Explain the difference between .twb and .twbx
.twb is the most common file extension used in Tableau, which presents an XML format file and comprises all the information present in each dashboard and sheet like what fields are used in the views, styles and formatting applied to a sheet and dashboard.But this workbook does not contain any data. The Packaged workbook merges the information in a Tableau workbook with the local data available (which is not on server). .twbx serves as a zip file, which will include custom images if any. Packaged Workbook allows users to share their workbook information with other Tableau Desktop users and let them open it in Tableau Reader.
96. What are Extracts and Schedules in Tableau server?
Data extracts are the first copies or subdivisions of the actual data from original data sources. The workbooks using data extracts instead of those using live DB connections are faster since the extracted data is imported in Tableau Engine.After this extraction of data, users can publish the workbook, which also publishes the extracts in Tableau Server. However, the workbook and extracts won’t refresh unless users apply a scheduled refresh on the extract. Scheduled Refreshes are the scheduling tasks set for data extract refresh so that they get refreshed automatically while publishing a workbook with data extract. This also removes the burden of republishing the workbook every time the concerned data gets updated.
97. Name the components of a Dashboard
 Horizontal – Horizontal layout containers allow the designer to group worksheets and dashboard components left to right across your page and edit the height of all elements at once.
 Vertical – Vertical containers allow the user to group worksheets and dashboard components top to bottom down your page and edit the width of all elements at once.
 Text
 Image Extract : – A Tableau workbook is in XML format. In order to extracts images, Tableau applies some codes to extract an image which can be stored in XML.
 Web [URL ACTION] : A URL action is a hyperlink that points to a Web page, file, or other webbased resource outside of Tableau. You can use URL actions to link to more information about your data that may be hosted outside of your data source. To make the link relevant to your data, you can substitute field values of a selection into the URL as parameters.
98. How to view underlying SQL Queries in Tableau?
Viewing underlying SQL Queries in Tableau provides two options :
 Create a Performance Recording to record performance information about the main events you interact with workbook. Users can view the performance metrics in a workbook created by Tableau.
 Help> Settings and Performance> Start Performance Recording
 Help> Setting and Performance > Stop Performance Recording.
 Reviewing the Tableau Desktop Logs located at C:\Users\\My Documents\My Tableau Repository. For live connection to data source, you can check log.txt and tabprotosrv.txt files. For an extract, check tdeserver.txt file.

What is Page shelf?
Tableau provides a distinct and powerful tool to control the output display known as Page shelf. As the name suggests, the page shelf fragments the view into a series of pages, presenting a different view on each page, making it more userfriendly and minimizing scrolling to analyze and view data and information. You can flip through the pages using the specified controls and compare them at a common axle.
100. How to do Performance Testing in Tableau?
Performance testing is again an important part of implementing tableau. This can be done by loading Testing Tableau Server with TabJolt, which is a “Point and Run” load generator created to perform QA. While TabJolt is not supported by tableau directly, it has to be installed using other open source products.
101. How many maximum tables can you join in Tableau?
The maximum number of 32 tables can be joined in Tableau. A table size must also be limited to 255 columns (fields).
2 Comments
Way cool! Some extremely valid points! I appreciate you writing this
writeup plus the rest of the website is also very good.
Hi, I was visiting and wanted to let you
know about a service like Digital marketing and data science projects.
you have a very good rank in Search engines for Data Science and Digital marketing training as well as Services.
Thanks for very good info about Data Science which is most trending and valuable.