Research Methods in Forest Sciences: Learning Diary Yoko Lu 285122 9 December 2016 1. Research process It is important to pursue and apply knowledge and understand the world under both natural and social processes in a systematic way of thinking based on proof derived from observation and experiment. It all started from Aristotle who was the founder of science. The first step is to observe the problem, or rather a perspective from scientific way. Then a hypothesis is made to ensure that the observation is consistent with a temporary description. Predictions are made by using the hypothesis. Experiments are based on these predictions to further observe the problem and change the hypothesis if the results are needed to be altered. If there is no immediate conclusion between the hypothesis and observation, then predictions are needed to be created again by using the hypothesis and modify the hypothesis if required. When there is consistency shown in the experiment that governs the hypothesis, then it can be concluded that results can be repeatable and theories can be falsified as well. Stages in the research problem are: (1) problems, (2) theories, (3) criticism, and (4) new problems. Statistics is an important tool in research, where the statistical results can show whether the results are significant or not, and what the numerical values mean. Statistical can also show historical trend of the subject and the statistics is more about concepts, where the statistical analysis can be highly represented as key answers to the important questions. To statistically test the results, main softwares that are used to analyze statistically are Excel, SPSS, and R, to name a few. While all these tools contribute to the statistical analysis, it is important to note when and why one software is used, but not another, because each software has it both pros and cons, such as whether the software is user friendly or it has all analysis tools.
Sampling is one example in statistical analysis. Sampling determines the whole population when there is not everything in one population, which is the concept of inference. For example, when we conduct observation and experiment on soil composition and its ecosystem, we cannot obtain all the soils and its components. Therefore, we take samples of soil and experiment/observe in a laboratory setting, for example. 2. Basic concepts in statistics Basic concepts in statistics are mean, variance, standard deviation, error, distributions, and degrees of freedom: Mean: average of values. Basically, total of values divided number of samples. Variance: difference between highest and lowest value in the sample. Also referred as dispersion. Standard deviation: how extent are the values dispersed from the mean. Error: a concept of variation, where the statistics do not comply to the real world (there are differences in the real world) Distribution: Normal distribution symmetric of mean value, which can also include mode and median. Standard deviation is included to note the inflection point of curve which points away from the mean. Examples include standard scores on a test in one class and physiological characteristics like those used in the pharmacy. Degrees of freedom: how probabilities are allocated on events. (=probability distribution) When it comes to sampling, we do not have any data on the population (i.e. mean and standard deviation). Therefore, when we take sample from the whole population, we do the statistical analysis to gain mean and standard deviation for sample. If we take multiple samples from the whole population, then we observe different patterns in the mean and standard deviation. Here, normal distribution is useful when we want to know the mean of the whole population. In normal distribution, there is usually variation, depending how the graph is represented.
3. T-test and ANOVA It is important to come up with hypotheses to test the experiment to see whether the goal of study is true of its aim. Null hypothesis (H0) is used when the theory is thought to be true or that because it is used as a question to point to the experiment. The alternative hypothesis (H1) is a statistical statement that asks the experiment in general way. Therefore, null hypothesis is the opposite of alternative hypothesis. While applying the hypothesis to the experiment, it is important to assume that there is normality, size, and whether the samples are randomly selected. T-test is used to compare the variances and test whether these variances are equal or not. H0 hypothesis considers variances to be equal while H1 hypothesis considers variances to be not equal (Levene s test). When it comes to results, rather than considering variances, means are checked as well, H0 refers to equal means while H1 refers to different means (F-Fischer s test). P-value comes to be the key factor while looking at the significance level of the test. When the p-value is greater than 0.1, it supports H0. When the value is over 0.05 but when equal or greater than 0.1, it still supports H0. However, when the value is equal or greater than 0.01 and equal or less than 0.05, this does not support H0, therefore, H0 is dismissed and H1 is accepted as result is more statistically significant. If value is equal or greater than 0.001 but equal or less than 0.01, the result is same as previous, but is more significant than the previous. Finally when the value is equal or less than 0.001, H1 is accepted and the result is very statistically significant. T-test has three different types: one sample t-test, independent sample t-test, and paired samples t-test. One sample t-test calculates whether a single variable mean is different from a specified variable. Independent sample t-test is when the two case groups are compared. It is best when the test samples are randomly selected to two groups. For paired sample t-test, two variable means are compared for one group.
When we compare 3 or more different samples, we use one-way ANOVA test. Means of replicated experiments are compared where one input is different at different settings/levels. It is important to find out about the proportion of variability because of different factors. When H0 is not accepted, then it means that variation of output is different between levels, not because of random error. It also means that there is variance in the different level outputs when it is significant and it is important to note where the actual variances are when determining levels. 4. Basics of modeling: simple regression It is important to consider parsimony and simplicity while constructing a model. Parsimony means that the variables and parameters should not be included in the model when they are needed. It is ideal to keep the model as simple as possible, as unnecessary factors just make the model more complex and create more problems. Linear regression model, as the name suggests, is a model that compares between variables (i.e. dependent vs independent variable) in straight line form. When it is a simple linear regression, it has only variable, while multiple linear regression has more than one variable. Linear regression model assumes that there is a mean of 0 in the normal distribution for the error, error variance is constant in the model but independent of variables, and error value, same as error variance, is independent of variable values as well as error values. To determine the relationship strength between the model and the dependent variable, R2 (R is called Pearson correlation coefficient) is used. Here, the ANOVA table is used to determine if there is any difference in dependent variable, although this is just an indirect way to determine the relationship strength. When the regression and residual sums of squares are almost identical, as shown in the case when analyzed using the R Studio software, then this means that only about half of the variation is explained by the (ANOVA) model.
Representativity (sample size, population), normality (model should follow a normal distribution), independency of errors (randomly collected), and homoscedasticity (variance equality between data set) are key elements to be considered while creating a simple regression model. 5. Advanced models: alternatives to simple regression It is important to choose an appropriate model based on the study purpose and what information we are looking for. It is assumed that constant variance, normal errors, and independent errors are included in the behavior of the response variable. Response variable (continuous measurements i.e. weight, height; other data examples include count and proportion) is transformed and even coupled there is one or more explanatory variables. When it is transformed, it means that the formula Y = B0 + B1X is altered. For example, functions shown below are transformed to make the model to become linear. It is important to consider to look for substitutes for transformation so models can be created effectively. It is crucial to note that response variables all do not consider assumptions into constancy of variance and normality of errors. Other than transformations, the other alternatives to simple regression models are fitting linear model, different linear model, and non-linear models, to name a few. For the model to be more effectively and efficiently represented for the results that are needed for the analysis, it may be
wise to remove outliers, find the best model that includes extra variable, and methods dealing with multicollinearity. 6. Validating of models Models need to be validated to ensure that the models are correct and there are is no bias or misleading statistical analyzed results. When the model is created, we need to ask whether the model works outside the study area. For example, if the model is best fitted for one subject in one region, will it be best fitted in another region as well? Here, R is one script/tool that is used to validate the model. After inputting data set to build a model, it is necessary to check model results to see whether the model is applicable for the modeling data set and whether the model makes sense. R2, p-values, and residuals should be checked. RMSE (root mean squared error) and bias are some measure that are commonly used when validating models. RMSE determines how the measured and modeled values are scatted in the model, which refers to precision of the model. Bias is how the average level is different between the modeled value and measured value, which refers to accuracy of the model. Below are the formulas for RMSE and BIAS. The top functions are absolute while the below functions with percentage are relative values.
7. Presentation and interpretation of scientific results Note: This section is highly related section 1 (Research process) so I am going to answer this question from a different view: from the perspective of R. R is one way that presents and interprets scientific results through coding. Basic statistical diagrams can be created via R, such as scatter plots, boxplots, and histograms. As provided in the lecture notes, this package called ggplot2 (http://r4stats.com/examples/graphics-ggplot2/) provides graphs with colors, so it is much more convenient to interpret the graph, especially when there are many plots. Range for y-axis can also be modified, for example. Other options like fit lines can be plotted, and axis labels and ranges can be added into the plot. Within ggplot2 application, bar plots and pie charts can also be made. It is also possible visualize more than one distribution in one graph, as shown below (graph taken from course notes). Other than the statistical analysis (i.e. plots, charts), when there are spatial and temporal data included in the analysis, then the data can be analyzed and interpreted in the GIS tool. Raster data can be analyzed through different tools within the GIS software, such as resampling of resolution and reclassification. Another tool is Ggmap, where the maps can be created by adding additional plots into existing maps, as shown below (from course material):
8. Qualitative research & surveys It is important to conclude observations and survey different study subjects and research based on these. Count data is one example that can be part of the qualitative research. Counts are based on frequencies, which is events are counted to see how many times have they occurred. Number of dead trees after a deadly storm, number of people visiting a website per day, and number of microorganisms on a leaf are all examples when it comes to count data. When performing statistical analysis, answers such as how the occurrence of dead trees in one part have effect on other areas with dead trees (which can be a response variable). Proportional data is another type of data that is defined as qualitative. Unlike count data, here, we look at two or more variables. In the case of number of dead trees after a deadly storm, we include surviving trees as well. Other examples include infection rates of different diseases and emissions of different gases. In statistical analysis, percentage is used as a response variable. 9. GIS tools (and Remote Sensing) When it comes to applying GIS into incorporating with statistical analysis and models, data needs to be collected first, as mentioned in the previous section on qualitative research & surveys. GIS is an important tool when it comes to decision making. Monitoring, field survey, and remote sensing are all key areas when it comes to applying into GIS.
When looking at the map/model created by the GIS, there are three different model types: descriptive, predictive, and prescriptive. Descriptive model looks for patterns, processes, and spatial/temporal interaction i.e. distribution of insects. Predictive model, as the name suggest, gives simulations, as well as providing stochastic model and scenario analysis i.e. global temperature change. Prescriptive model is to ask what is the best step, also known as optimization ie. Best option to manage prevention of insect disease/pest spread. One example is the study of snow and wind damage. Plots of past years are collected and input into the ArcGIS for interpretation of the data. Number of affected and unaffected trees that are affected by snow and wind are also included in the analysis, as well as wind/snow data, forest type, and management variables at stand level (i.e. basal area, diameter, and height). Frequency and variation according to time, spatial patterns, and average damage on trees are studied. By using all of these variables, wind and snow spatial patterns can be mapped out using different categories of different numerical values (high low) so it can be easier to visualize the model when there are not so many numerical value categories so the model cannot be confused. When the numbers are needed to be analyzed further, then the attribute table within the ArcGIS tool can be used to examine, by using the summary, or through exporting the file to be used in other softwares, such as Excel. In general, depending on what the study needs to be analyzed, it can be vector (2D) or raster (3D). Raster usually includes remote sensing tools such that aerial photos and variables that are not entirely horizontally, such as data on wind and snow are usually rasterized (along with elevation, topography, etc. if these need to be included into the model.