Estimating a Gravity Model
This tutorial demonstrates a basic gravity analysis including loading data, constructing some summary statistics, estimating a model, and outputting the results in several possible formats. For more information, see the list of commands and API reference in this documentation.
Load Data
The gme package uses a special object, called gme.EstimationData to store data. EstimationData includes a Pandas.DataFrame containing data (trade + gravity + etc.) for estimation as well as additional information about that data and methods that can be used to summarize and/or manipulate the data. This tutorial will demonstrate some of these features.
First, we must begin by creating a gme.EstimationData. Doing so requires the inputting of a Pandas.DataFrame and several pieces of "meta-data" that describe the data. Start by loading a dataset using the read_csv() function from pandas. In the sample code below, we will read a dataset directly from the internet, but you could just as easily read the same file from your hard drive.
>>> import gme as gme >>> import pandas as pd >>> gravity_data = pd.read_csv('https://www.usitc.gov/data/gravity/example_trade_and_grav_data_small.csv') >>> gravity_data.head() importer exporter year trade_value agree_pta common_language \ 0 AUS BRA 1989 3.035469e+08 0.0 1.0 1 AUS CAN 1989 8.769946e+08 0.0 1.0 2 AUS CHE 1989 4.005245e+08 0.0 1.0 3 AUS DEU 1989 2.468977e+09 0.0 0.0 4 AUS DNK 1989 1.763072e+08 0.0 1.0 contiguity log_distance 0 0.0 9.553332 1 0.0 9.637676 2 0.0 9.687557 3 0.0 9.675007 4 0.0 9.657311 # Next, we use the loaded data to create an EstimationData instance called gme_data >>> gme_data = gme.EstimationData(data_frame=gravity_data, imp_var_name='importer', exp_var_name='exporter', trade_var_name='trade_value', year_var_name='year', notes='Downloaded from https://www.usitc.gov/data/gravity/example_trade_and_grav_data_small.csv')
In creating an instance of the EstimationData object, the user is asked to supply a Pandas.DataFrame, which we a loaded in the previous lines, and several types of descriptive arguments. These arguments (imp_var_name, exp_var_name, trade_var_name, and year_var_name) specify the columns in the supplied DataFrame corresponding to particular types of information that will likely be present in any gravity analysis (the column containing the importer ID, exporter ID, trade values, and year, respectively). These "meta-data" fields can be useful as they prevent users from having to re-enter these same basic characteristics of the data at later points and permit the automatic construction of certain types of summary information. Finally, an optional note is supplied to the EstimationData. The EstimationData object has a attribute that stores a list user-supplied strings for later reference. In this case, we have supplied a note indicating from where the data originated.
Working with EstimationData
In addition to providing an object class that communicates conveniently with the gme.EstimationModel (see below), the EstimationData provides a collection of data summary and manipulation tools. For example, simply calling (or printing) the object, returns a summary of the scope of the data:
>>> gme_data number of countries: 62 number of exporters: 62 number of importers: 62 number of years: 27 number of sectors: not_applicable dimensions: (98612, 8)
Other summary information can be reported in the following ways:
# Return the number of importers in the dataset. >>> gme_data.number_of_importers 62 # Return a list of the column names >>> gme_data.columns ['importer', 'exporter', 'year', 'trade_value', 'agree_pta', 'common_language', 'contiguity', 'log_distance'] # Return a list of years in the dataset >>> gme_data.year_list() [1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015] # Return a dictionary containing a list of countries in the dataset for each year. >>> country_list = gme_data.countries_each_year() >>> country_list[1989] ['IRN', 'BOL', 'TUR', 'ARG', 'CHL', 'HUN', 'KEN', 'VEN', 'ZAF', 'URY', 'BRA', 'DZA', 'PER', 'IRL', 'DNK', 'GHA', 'KOR', 'PAK', 'COL', 'IND', 'ISL', 'ISR', 'ESP', 'ITA', 'NLD', 'NGA', 'AUS', 'SWE', 'PRY', 'GBR', 'IDN', 'HKG', 'NOR', 'TUN', 'EGY', 'KWT', 'DEU', 'CHE', 'MYS', 'NZL', 'LBY', 'USA', 'SDN', 'CHN', 'GRC', 'MEX', 'CAN', 'PRT', 'SAU', 'POL', 'PHL', 'THA', 'FRA', 'JPN', 'MAR', 'AUT', 'FIN', 'SGP', 'ECU'] # Additionally, many of the descriptive methods from Pandas.DataFrames have been inherited: >>> gme_data.dtypes() importer object exporter object year int64 trade_value float64 agree_pta float64 common_language float64 contiguity float64 log_distance float64 dtype: object >>> gme_data.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 98612 entries, 0 to 98611 Data columns (total 8 columns): importer 98612 non-null object exporter 98612 non-null object year 98612 non-null int64 trade_value 98612 non-null float64 agree_pta 97676 non-null float64 common_language 97676 non-null float64 contiguity 97676 non-null float64 log_distance 97676 non-null float64 dtypes: float64(5), int64(1), object(2) memory usage: 6.0+ MB >>> gme_data.describe() year trade_value agree_pta common_language \ count 98612.000000 9.861200e+04 97676.000000 97676.000000 mean 2002.210441 1.856316e+09 0.381547 0.380646 std 7.713050 1.004735e+10 0.485769 0.485548 min 1989.000000 0.000000e+00 0.000000 0.000000 25% 1996.000000 1.084703e+06 0.000000 0.000000 50% 2002.000000 6.597395e+07 0.000000 0.000000 75% 2009.000000 6.125036e+08 1.000000 1.000000 max 2015.000000 4.977686e+11 1.000000 1.000000 contiguity log_distance count 97676.000000 97676.000000 mean 0.034051 8.722631 std 0.181362 0.818818 min 0.000000 5.061335 25% 0.000000 8.222970 50% 0.000000 9.012502 75% 0.000000 9.303026 max 1.000000 9.890765
Additionally, the EstimationData object retains the full ability to work with the supplied DataFrame. The DataFrame can be easily accessed by referring to its attribute in EstimationData.
# Return the column of trade_values >>> gme_data.data_frame['trade_value'] 0 3.035469e+08 1 8.769946e+08 2 4.005245e+08 ... 98609 0.000000e+00 98610 0.000000e+00 98611 0.000000e+00 Name: trade_value, Length: 98612, dtype: float64
Finally, the EstimationData object features a tool for easy aggregation and custom summary information. Additionally, because the method used for this process returns a DataFrame, the aggregated information can itself be used for many other applications, including the creation of a new EstimationData object.
Note
Knowing when to end a command with ( ): When first learning python, it can be confusing trying to determine when a command applied to an object should be followed by parentheses. In the preceding code example, you will see instances of both: gme_data.columns and gme_data.year_list( ), for example. Knowing which format to use is largely a matter of becoming familiar with the functions you are using. However, there is a logic to it. Each time a command is applied to an object (i.e. using the syntax object.command), you are calling either an attribute of the object or a method on the object. An attribute is a piece of information that has already been created and included in the object whereas a method is effectively a function that can be run on the object. A method will be called with two parentheses because they will often accept additional arguments. For example, this is the case with the DataFrame method head( ), which can accept a user-provided number of rows. However, you will often find that you do not need to supply additional arguments to a method, in which case you leave the parentheses empty. An attribute, by comparison, does not feature ( ) because there is no need or ability to provide additional input because the contents of that attribute have already been computed. As mentioned before, knowing whether a command is an attribute or a method, however, simply requires familiarity with the object.
Creating and estimating a model
Once a EstimationData has been created, estimating a gravity model using the data is fairly straightforward. There are two basic step for estimation: (1) define a model and (2) estimate the model.
Defining the model amounts to creating another object called EstimationModel. Like the EstimationData, EstimationModel is meant to standardize and simplify the steps typically taken to specify and estimate a gravity model. While the EstimationData is meant to be an object that is created once for each study, many EstimationModels will likely be defined and redefined as you test different specifications. Thus, the arguments and attributes of the EstimationModel reflect the different types of modifications you may want to make as you select your preferred specification.
As with the EstimationData, the EstimationModel is largely a dataset, with added information that define the characteristics of the particular model. The following examples depict several model specifications, each demonstrating different types of model aspects that can be specified.
# A simple case in which 'trade_value' is dependent on 'log_distance', 'agree_pta', etc. >>> model_baseline = gme.EstimationModel(estimation_data = gme_data, lhs_var = 'trade_value', rhs_var = ['log_distance', 'agree_pta', 'common_language', 'contiguity']) # A specification that will generate and include importer-year and exporter-year # fixed effects >>> fixed_effects_model = gme.EstimationModel(estimation_data = gme_data, lhs_var = 'trade_value', rhs_var = ['log_distance', 'agree_pta', 'common_language', 'contiguity'], fixed_effects=[['importer','year'], ['exporter','year']]) # A specification that uses a subset of the data. The United States ('USA') will be omitted # and only the years 2013--2015 will be included. >>> data_subset_model = gme.EstimationModel(estimation_data = gme_data, lhs_var = 'trade_value', rhs_var = ['log_distance', 'agree_pta', 'common_language', 'contiguity'], drop_imp_exp=['USA'], keep_years=[2015, 2014, 2013])
- Model Variables: The variables to be included are specified using the arguments lhs_vars and rhs_vars, which denote the left-hand-side dependent variable and right-hand-side independent variables, respectively.
- Fixed Effects: The model, at the point at which it is estimated, will construct fixed effects if any are specified by fixed_effects. These can be either single variables (e.g. ['importer']), or interacted variables (e.g. [['importer', 'year']]). For example, entering [ 'importer', ['exporter', 'year']] would yield a set of importer fixed effects and a set of exporter-year fixed effects.
- Data Subsets: Subsets of the data to use for estimation can be specified in a variety of ways. The arguments keep_years and drop_years can be used to select only a subset of years to include. Similarly the keep_imp, keep_exp, and keep_imp_exp arguments, and their corresponding drop_... options can do the same for importers and/or exporters.
Once a model has been defined, running a PPML estimation according to the supplied specification is quite straightforward. It only requires the application of a single method of the EstimationModel: .estimate(). No further inputs are required.
# Define a new, fixed effects model using only a subset of years (to reduce the computation time) >>> fixed_effects_model_2 = gme.EstimationModel(estimation_data = gme_data, lhs_var = 'trade_value', rhs_var = ['log_distance', 'agree_pta', 'common_language', 'contiguity'], fixed_effects=[['importer','year'], ['exporter','year']], keep_years = [2013,2014,2015]) # Conduct a PPML estimation of the fixed effects model. estimates = fixed_effects_model_2.estimate() select specification variables: ['log_distance', 'agree_pta', 'common_language', 'contiguity', 'trade_value', 'importer', 'exporter', 'year'], Observations excluded by user: {'rows': 0, 'columns': 0} drop_intratrade: no, Observations excluded by user: {'rows': 0, 'columns': 0} drop_imp: none, Observations excluded by user: {'rows': 0, 'columns': 0} drop_exp: none, Observations excluded by user: {'rows': 0, 'columns': 0} keep_imp: all available, Observations excluded by user: {'rows': 0, 'columns': 0} keep_exp: all available, Observations excluded by user: {'rows': 0, 'columns': 0} drop_years: none, Observations excluded by user: {'rows': 0, 'columns': 0} keep_years: [2013, 2014, 2015], Observations excluded by user: {'rows': 87632, 'columns': 0} drop_missing: yes, Observations excluded by user: {'rows': 0, 'columns': 0} Estimation began at 08:58 AM on Jun 19, 2018 Estimation completed at 08:58 AM on Jun 19, 2018
Viewing, formatting, and outputting the results
The first step to viewing and working with the regression estimates is unpacking them from the dictionary in which they have been stored. A dictionary is an object in which each item stored in the dictionary is associated with a key. That key can be used to return its associated item. In the above example, estimates is a dictionary in which each item is an object of results. In our example there is only one object because only one regression was run. In cases in which multiple regressions are run because multiple sectors are estimated separately, the dictionary would contain multiple results, each keyed with the name of the respective sector.
# Return a list of keys in the object >>> estimates.keys() dict_keys(['all']) # Return the result object and save it to a new variable for convenience >>> results = estimates['all']
The estimation uses tools from the statsmodels package so that the results inherit all of the features of the statsmodels GLM results object.3 This means that the object contains a plethora of fields reflecting things like coefficient estimates, standard errors, p-values, AIC/BIC, etc. Similarly, there is a useful method associated with the object that can be used for creating summary tables.
# print a summary of the results >>> results.summary() <class 'statsmodels.iolib.summary.Summary'> Generalized Linear Model Regression Results ============================================================================== Dep. Variable: trade_value No. Observations: 8700 Model: GLM Df Residuals: 8371 Model Family: Poisson Df Model: 328 Link Function: log Scale: 1.0 Method: IRLS Log-Likelihood: -4.8282e+12 Date: Wed, 20 Jun 2018 Deviance: 9.6565e+12 Time: 13:36:10 Pearson chi2: 1.22e+13 No. Iterations: 10 ============================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------------------- log_distance -0.7398 0.024 -30.982 0.000 -0.787 -0.693 agree_pta 0.3342 0.043 7.824 0.000 0.250 0.418 common_language 0.1288 0.039 3.270 0.001 0.052 0.206 contiguity 0.2552 0.047 5.423 0.000 0.163 0.347 importer_year_fe_ARG2013 26.9804 0.361 74.690 0.000 26.272 27.688 importer_year_fe_ARG2014 26.8032 0.344 77.840 0.000 26.128 27.478 importer_year_fe_AUS2013 28.1690 0.315 89.455 0.000 27.552 28.786 ... (truncated for this tutorial) ============================================================================================ # Extract the estimated parameter values (returned as a Pandas.Series) >>> coefficients = results.params >>> coefficients.head() log_distance -0.739840 agree_pta 0.334219 common_language 0.128770 contiguity 0.255161 importer_year_fe_ARG2013 26.980367 dtype: float64 # Extract the standard errors >>> results.bse log_distance 0.023879 agree_pta 0.042720 ... exporter_year_fe_VEN2015 0.346733 Length: 329, dtype: float64 # Extract the p-values >>> results.pvalues log_distance 9.318804e-211 agree_pta 5.134355e-15 ... exporter_year_fe_VEN2015 5.681631e-03 Length: 329, dtype: float64 # Return fitted values >>> results.fittedvalues 0 1.610136e+09 1 3.044133e+08 2 5.799368e+08 ... 9359 1.329831e+10 Length: 8700, dtype: float64
# Return diagnostic information (a Pandas.Series or DataFrame) >>> fixed_effects_model_2.ppml_diagnostics Overfit Warning No Collinearities Yes Number of Columns Excluded 41 Perfectly Collinear Variables [exporter_year_fe_ZAF2013, exporter_year_fe_ZA... Zero Trade Variables [importer_year_fe_ARG2015, importer_year_fe_AU... Completion Time 0.25 minutes dtype: object # Retrieve the full list of collinear columns >>> fixed_effects_model_2.ppml_diagnostics['Perfectly Collinear Variables'] ['exporter_year_fe_ZAF2013', 'exporter_year_fe_ZAF2014', 'exporter_year_fe_ZAF2015']
The gme package also features several tools to help compile and format the results for use within python or outside of it. These tools include one method named combine_sector_results() for pulling all coefficients, standard errors, and p-values from from multiple sectors into a single DataFrame. The second, called format_regression_tables, creates formatted tables for presentation that can be exported as a text file, csv file, or LaTeX file with some stylized syntax.
Tip
LaTeX users: The method format_regression_table can output a table into a csv file with some desired LaTeX syntax. This can be done by specifying format = '.csv' and 'latex_syntax'=True. This option allows users to manipulate the tables in spreadsheet software while retaining LaTeX syntax. We recommend reading the file into the spreadsheet software as 'text data' so that it does not try to interpret the formatting of the table entries.
The gme package also has several ways to save the results to a file, either in full or in a space-saving slim version. See the API Reference portion of this documentation for further information.
-
Had their been multiple sectors, we could have indicated so by adding the input sector_var_name = 'sector_column' in the declaration of the EstimationData. ↩
-
Additionally, the results of the estimation are saved as a attribute of the estimation model---EstimationModel.results_dict---and can be retrieved that way as well. ↩
-
For more details about the statsmodels results object, see http://www.statsmodels.org/0.6.1/generated/statsmodels.genmod.generalized_linear_model.GLMResults.html. ↩