Back to edX Analytics Pipeline

Supporting Tasks

common.elasticsearch_load

Load records into elasticsearch clusters.

class edx.analytics.tasks.common.elasticsearch_load.ElasticsearchIndexTask(*args, **kwargs)

Index a stream of documents in an elasticsearch index.

This task is intended to do the following: * Create a new index that is unique to this task run (all significant parameters). * Load all of the documents into this unique index. * If the alias is already pointing at one or more indexes, switch it so that it only points at this newly loaded

index.
  • Delete any indexes that were previously pointed at by the alias, leaving only the newly loaded index.
Parameters:
  • alias (Parameter) – Name of the alias in elasticsearch that will point to the complete index when loaded. This value should match the settings of edx-analytics-data-api.
  • batch_size (IntParameter, optional, insignificant) – Number of records to submit to the cluster to be indexed in a single request. A small value here will result in more, smaller, requests and a larger value will result in fewer, bigger requests. Default is 1000.
  • connection_type (Parameter, configurable, insignificant) – If not specified, default to using urllib3 to make HTTP requests to elasticsearch. The other valid value is “aws” which can be used to connect to clusters that are managed by AWS. See AWS elasticsearch service.
  • host (Parameter, configurable) – Hostnames for the elasticsearch cluster nodes. They can be specified in any of the formats accepted by the elasticsearch-py library. This includes complete URLs such as http://foo.com/, or host port pairs such as foo:8000. Note that if you wish to use SSL you should specify a full URL and the “https” scheme. Default is pulled from elasticsearch.host.
  • indexing_tasks (IntParameter, optional, insignificant) – Number of parallel processes to use to submit records to be indexed from. The stream of records will be divided up evenly among these processes during the indexing procedure. Default is None.
  • input_format (Parameter, optional, insignificant) – The input_format for Hadoop job to use. For example, when running with manifest file, specify “oddjob.ManifestTextInputFormat” for input_format. Default is None.
  • lib_jar (ListParameter, optional, insignificant) – A list of library jars that the Hadoop job can make use of. Default is [].
  • mapreduce_engine (Parameter, configurable, insignificant) – Name of the map reduce job engine to use. Use hadoop (the default) or local.
  • max_attempts (IntParameter, optional, insignificant) – If the elasticsearch cluster rejects a batch of records (usually because it is too busy) the indexing process will retry up to this many times before giving up. It uses an exponential back-off strategy, so a high value here can result in very significant wait times before retrying. Default is 10.
  • n_reduce_tasks (Parameter, optional, insignificant) – Number of reducer tasks to use in upstream tasks. Scale this to your cluster size. Default is 25.
  • number_of_shards (Parameter, optional) – Number of shards to use in the elasticsearch index. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • remote_log_level (Parameter, configurable, insignificant) – Level of logging for the map reduce tasks. Default is pulled from map-reduce.remote_log_level.
  • throttle (FloatParameter, optional, insignificant) – Wait this many seconds between batches of records submitted to the cluster to be indexed. This can be used to tune the indexing process, allowing the cluster to successfully “keep up” with the loader. Note that often the hadoop cluster can load records much more quickly than the cluster can index them, which eventually causes queues to overflow within the elasticsearch cluster. Default is 0.1.
  • timeout (FloatParameter, configurable, insignificant) – Maximum number of seconds to wait when attempting to make connections to the elasticsearch cluster before assuming the cluster is not responding and giving up with a timeout error. Default is pulled from elasticsearch.timeout.

common.mapreduce

Support executing map reduce tasks.

class edx.analytics.tasks.common.mapreduce.MapReduceJobTask(*args, **kwargs)

Execute a map reduce job. Typically using Hadoop, but can execute the job in process as well.

Parameters:
  • input_format (Parameter, optional, insignificant) – The input_format for Hadoop job to use. For example, when running with manifest file, specify “oddjob.ManifestTextInputFormat” for input_format. Default is None.
  • lib_jar (ListParameter, optional, insignificant) – A list of library jars that the Hadoop job can make use of. Default is [].
  • mapreduce_engine (Parameter, configurable, insignificant) – Name of the map reduce job engine to use. Use hadoop (the default) or local.
  • n_reduce_tasks (Parameter, optional, insignificant) – Number of reducer tasks to use in upstream tasks. Scale this to your cluster size. Default is 25.
  • pool (Parameter, optional, insignificant) – Default is None.
  • remote_log_level (Parameter, configurable, insignificant) – Level of logging for the map reduce tasks. Default is pulled from map-reduce.remote_log_level.
class edx.analytics.tasks.common.mapreduce.MultiOutputMapReduceJobTask(*args, **kwargs)

Produces multiple output files from a map reduce job.

The mapper output tuple key is used to determine the name of the file that reducer results are written to. Different reduce tasks must not write to the same file. Since all values for a given mapper output key are guaranteed to be processed by the same reduce task, we only allow a single file to be output per key for safety. In the future, the reducer output key could be used to determine the output file name, however.

Parameters:
  • delete_output_root (BoolParameter, optional, insignificant) – If True, recursively deletes the output_root at task creation. Default is False.
  • input_format (Parameter, optional, insignificant) – The input_format for Hadoop job to use. For example, when running with manifest file, specify “oddjob.ManifestTextInputFormat” for input_format. Default is None.
  • lib_jar (ListParameter, optional, insignificant) – A list of library jars that the Hadoop job can make use of. Default is [].
  • mapreduce_engine (Parameter, configurable, insignificant) – Name of the map reduce job engine to use. Use hadoop (the default) or local.
  • marker (Parameter, configurable, insignificant) – A URL location to a directory where a marker file will be written on task completion. Default is pulled from map-reduce.marker.
  • n_reduce_tasks (Parameter, optional, insignificant) – Number of reducer tasks to use in upstream tasks. Scale this to your cluster size. Default is 25.
  • output_root (Parameter) – A URL location where the split files will be stored.
  • pool (Parameter, optional, insignificant) – Default is None.
  • remote_log_level (Parameter, configurable, insignificant) – Level of logging for the map reduce tasks. Default is pulled from map-reduce.remote_log_level.

common.mysql_load

Support for loading data into a Mysql database.

class edx.analytics.tasks.common.mysql_load.IncrementalMysqlInsertTask(*args, **kwargs)

A MySQL table that is mostly appended to, but occasionally has parts of it overwritten.

When overwriting, the task is responsible for populating some records that need to be easy to identify. There should be a one-to-one relationship between a row and the task that was used to write it. It should be straightforward to construct a where clause that selects all of the rows generated by this task.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-export.credentials.
  • database (Parameter, configurable) – The name of the database to which to write. Default is pulled from database-export.database.
  • insert_chunk_size (IntParameter, optional, insignificant) – The number of rows to insert at a time. Default is 100.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • use_temp_table_for_overwrite (BoolParameter, optional, insignificant) – Whether to use a temp table for overwriting mysql data followed by a rename. Default is False.
class edx.analytics.tasks.common.mysql_load.MysqlInsertTask(*args, **kwargs)

A task for inserting a data set into RDBMS.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-export.credentials.
  • database (Parameter, configurable) – The name of the database to which to write. Default is pulled from database-export.database.
  • insert_chunk_size (IntParameter, optional, insignificant) – The number of rows to insert at a time. Default is 100.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • use_temp_table_for_overwrite (BoolParameter, optional, insignificant) – Whether to use a temp table for overwriting mysql data followed by a rename. Default is False.

common.sqoop

Gather data using Sqoop table dumps run on RDBMS databases.

class edx.analytics.tasks.common.sqoop.SqoopImportFromMysql(*args, **kwargs)

An abstract task that uses Sqoop to read data out of a MySQL database and writes it to a file in CSV format.

By default, the output format is defined by meaning of –mysql-delimiters option, which defines defaults used by mysqldump tool:

  • fields delimited by comma
  • lines delimited by
  • delimiters escaped by backslash
  • delimiters optionally enclosed by single quotes (‘)
Parameters:
  • additional_metadata (DictParameter, optional, insignificant) – Override this to provide the metadata file with additional information about the Sqoop output. Default is None.
  • columns (ListParameter, optional) – A list of column names to be included. Default is to include all columns.
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • delimiter_replacement (Parameter, optional) – Defines a character to use as replacement for delimiters that appear within data values, for use with Hive. Not specified by default.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • direct (BoolParameter, optional, insignificant) – Use mysqldumpi’s “direct” mode. Requires that no set of columns be selected. Default is True.
  • enclosed_by (Parameter, optional) – Defines the character to use on output to enclose field values. Default is None.
  • escaped_by (Parameter, optional) – Defines the character to use on output to escape delimiter values when they appear in field values. Default is None.
  • fields_terminated_by (Parameter, optional) – Defines the field separator to use on output. Default is None.
  • mysql_delimiters (BoolParameter, optional) – Use standard mysql delimiters (on by default).
  • null_string (Parameter, optional) – String to use to represent NULL values in output data. Default is None.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • optionally_enclosed_by (Parameter, optional) – Defines the character to use on output to enclose field values when they may contain a delimiter. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • table_name (Parameter) – The name of the table to import.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.common.sqoop.SqoopImportTask(*args, **kwargs)

An abstract task that uses Sqoop to read data out of a database and writes it to a file in CSV format.

Inherited parameters:
overwrite: Overwrite any existing imports. Default is false.
Parameters:
  • additional_metadata (DictParameter, optional, insignificant) – Override this to provide the metadata file with additional information about the Sqoop output. Default is None.
  • columns (ListParameter, optional) – A list of column names to be included. Default is to include all columns.
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • delimiter_replacement (Parameter, optional) – Defines a character to use as replacement for delimiters that appear within data values, for use with Hive. Not specified by default.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • enclosed_by (Parameter, optional) – Defines the character to use on output to enclose field values. Default is None.
  • escaped_by (Parameter, optional) – Defines the character to use on output to escape delimiter values when they appear in field values. Default is None.
  • fields_terminated_by (Parameter, optional) – Defines the field separator to use on output. Default is None.
  • null_string (Parameter, optional) – String to use to represent NULL values in output data. Default is None.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • optionally_enclosed_by (Parameter, optional) – Defines the character to use on output to enclose field values when they may contain a delimiter. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • table_name (Parameter) – The name of the table to import.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.

enterprise.enterprise_database_imports

Import data from external RDBMS databases specific to enterprise into Hive.

class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportBenefitTask(*args, **kwargs)

Ecommerce: Imports offer benefit information from an ecommerce table to a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportConditionalOfferTask(*args, **kwargs)

Ecommerce: Imports conditional offer information from an ecommerce table to a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportDataSharingConsentTask(*args, **kwargs)

Imports the consent_datasharingconsent table to S3/Hive.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportEnterpriseCourseEnrollmentUserTask(*args, **kwargs)

Imports the enterprise_enterprisecourseenrollment table to S3/Hive.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportEnterpriseCustomerTask(*args, **kwargs)

Imports the enterprise_enterprisecustomer table to S3/Hive.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportEnterpriseCustomerUserTask(*args, **kwargs)

Imports the enterprise_enterprisecustomeruser table to S3/Hive.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportStockRecordTask(*args, **kwargs)

Ecommerce: Imports the partner_stockrecord table from the ecommerce database to a destination directory and a HIVE metastore.

A voucher is a discount coupon that can be applied to ecommerce purchases.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportUserSocialAuthTask(*args, **kwargs)

Imports the social_auth_usersocialauth table to S3/Hive.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.enterprise.enterprise_database_imports.ImportVoucherTask(*args, **kwargs)

Ecommerce: Imports the voucher_voucher table from the ecommerce database to a destination directory and a HIVE metastore.

A voucher is a discount coupon that can be applied to ecommerce purchases.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.

insights.calendar_task

A canonical calendar that can be joined with other tables to provide information about dates.

class edx.analytics.tasks.insights.calendar_task.CalendarTableTask(*args, **kwargs)

Ensure a hive table exists for the calendar so that we can perform joins.

Parameters:
  • interval (DateIntervalParameter, configurable) – Default is pulled from calendar.interval.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • warehouse_path (Parameter, configurable) – A URL location of the data warehouse. Default is pulled from hive.warehouse_path.
class edx.analytics.tasks.insights.calendar_task.CalendarTask(*args, **kwargs)

Generate a canonical calendar.

This table provides information about every day in every year that is being analyzed. It captures many complex details associated with calendars and standardizes references to concepts like “weeks” since they can be defined in different ways by various systems.

It is also intended to contain business-specific metadata about dates in the future, such as fiscal year boundaries, fiscal quarter boundaries and even holidays or other days of special interest for analysis purposes.

Parameters:
  • interval (DateIntervalParameter, configurable) – Default is pulled from calendar.interval.
  • output_root (Parameter) – URL to store the calendar data.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.

insights.database_imports

Import data from external RDBMS databases into Hive.

class edx.analytics.tasks.insights.database_imports.ImportAllDatabaseTablesTask(*args, **kwargs)

Imports a set of database tables from an external LMS RDBMS.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportAuthUserProfileTask(*args, **kwargs)

Imports user demographic information from an external LMS DB to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportAuthUserTask(*args, **kwargs)

Imports user information from an external LMS DB to a destination directory.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCouponVoucherIndirectionState(*args, **kwargs)

Ecommerce: Current: Imports the voucher_couponvouchers table from the ecommerce database to a destination directory and a HIVE metastore.

This table is just an extra layer of indirection in the source schema design and is required to translate a ‘couponvouchers_id’ into a coupon id. Coupons are represented as products in the product table, which is imported separately. A coupon can have many voucher codes associated with it.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCouponVoucherState(*args, **kwargs)

Ecommerce: Current: Imports the voucher_couponvouchers_vouchers table from the ecommerce database to a destination directory and a HIVE metastore.

A coupon can have many voucher codes associated with it. This table associates voucher IDs with ‘couponvouchers_id’s, which are stored in the voucher_couponvouchers table and have a 1:1 relationship to coupon IDs.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCourseEntitlementTask(*args, **kwargs)

Imports the table containing learners’ course entitlements.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCourseModeTask(*args, **kwargs)

Course Information: Imports course_modes table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCourseUserGroupTask(*args, **kwargs)

Imports course cohort information from an external LMS DB to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCourseUserGroupUsersTask(*args, **kwargs)

Imports user cohort information from an external LMS DB to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCurrentOrderDiscountState(*args, **kwargs)

Ecommerce: Current: Imports current order discount records from an ecommerce table to a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCurrentOrderLineState(*args, **kwargs)

Ecommerce: Current: Imports current order line items from an ecommerce table to a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCurrentOrderState(*args, **kwargs)

Ecommerce Current: Imports current orders from an ecommerce table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportCurrentRefundRefundLineState(*args, **kwargs)

Ecommerce: Current: Imports current refund line items from an ecommerce table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportEcommercePartner(*args, **kwargs)

Ecommerce: Current: Imports Partner information from an ecommerce table to a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportEcommerceUser(*args, **kwargs)

Ecommerce: Users: Imports users from an external ecommerce table to a destination dir.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportGeneratedCertificatesTask(*args, **kwargs)
Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportIntoHiveTableTask(*args, **kwargs)

Abstract class to import data into a Hive table.

Requires four properties and a requires() method to be defined.

Parameters:
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
class edx.analytics.tasks.insights.database_imports.ImportMysqlToHiveTableTask(*args, **kwargs)

Dumps data from an RDBMS table, and imports into Hive.

Requires override of table_name and columns properties.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportPersistentCourseGradeTask(*args, **kwargs)

Imports the grades_persistentcoursegrade table to S3/Hive.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportProductCatalog(*args, **kwargs)

Ecommerce: Products: Imports product catalog from an external ecommerce table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportProductCatalogAttributeValues(*args, **kwargs)

Ecommerce: Products: Imports product catalog attribute values from an external ecommerce table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportProductCatalogAttributes(*args, **kwargs)

Ecommerce: Products: Imports product catalog attributes from an external ecommerce table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportProductCatalogClass(*args, **kwargs)

Ecommerce: Products: Imports product catalog classes from an external ecommerce table to a destination dir.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartCertificateItem(*args, **kwargs)

Imports certificate items from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartCoupon(*args, **kwargs)

Imports coupon definitions from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartCouponRedemption(*args, **kwargs)

Imports coupon redeptions from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartCourseRegistrationCodeItem(*args, **kwargs)

Imports course registration codes from an external ecommerce table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartDonation(*args, **kwargs)

Imports donations from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartOrder(*args, **kwargs)

Imports orders from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartOrderItem(*args, **kwargs)

Imports individual order items from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportShoppingCartPaidCourseRegistration(*args, **kwargs)

Imports paid course registrations from an external LMS DB shopping cart table to both a destination directory and a HIVE metastore.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.
class edx.analytics.tasks.insights.database_imports.ImportStudentCourseEnrollmentTask(*args, **kwargs)

Imports course enrollment information from an external LMS DB to a destination directory.

Parameters:
  • credentials (Parameter, configurable) – Path to the external access credentials file. Default is pulled from database-import.credentials.
  • database (Parameter, configurable) – Default is pulled from database-import.database.
  • destination (Parameter, configurable) – The directory to write the output files to. Default is pulled from database-import.destination.
  • import_date (DateParameter, optional) – Date to assign to Hive partition. Default is today’s date, UTC.
  • num_mappers (Parameter, optional, insignificant) – The number of map tasks to ask Sqoop to use. Default is None.
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • verbose (BoolParameter, optional, insignificant) – Print more information while working. Default is False.
  • where (Parameter, optional) – A “where” clause to be passed to Sqoop. Note that no spaces should be embedded and special characters should be escaped. For example: –where “id<50”. . Default is None.

util.hive

Various helper utilities that are commonly used when working with Hive

class edx.analytics.tasks.util.hive.BareHiveTableTask(*args, **kwargs)

Abstract class that represents the metadata associated with a Hive table.

Note that all this task does is ensure that the table is created, it does not populate it with any data, simply runs the DDL commands to create the table.

Also note that it will not change the schema of the table if it already exists unless the overwrite parameter is set to True.

Parameters:
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • warehouse_path (Parameter, configurable) – A URL location of the data warehouse. Default is pulled from hive.warehouse_path.
class edx.analytics.tasks.util.hive.HivePartitionTask(*args, **kwargs)

Abstract class that represents the metadata associated with a partition in a Hive table.

Note that all this task does is ensure that the partition is created, it does not populate it with any data, simply runs the DDL commands to create the partition.

Parameters:
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • partition_value (Parameter) –
  • pool (Parameter, optional, insignificant) – Default is None.
  • warehouse_path (Parameter, configurable) – A URL location of the data warehouse. Default is pulled from hive.warehouse_path.
class edx.analytics.tasks.util.hive.HiveTableFromQueryTask(*args, **kwargs)

Creates a hive table from the results of a hive query.

Parameters:
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • warehouse_path (Parameter, configurable) – A URL location of the data warehouse. Default is pulled from hive.warehouse_path.
class edx.analytics.tasks.util.hive.HiveTableTask(*args, **kwargs)

Abstract class to import data into a Hive table.

Currently supports a single partition that represents the version of the table data. This allows us to use a consistent location for the table and swap out the data in the tables by simply pointing at different partitions within the folder that contain different “versions” of the table data. For example, if a snapshot is taken of an RDBMS table, we might store that in a partition with today’s date. Any subsequent jobs that need to join against that table will continue to use the data snapshot from the beginning of the day (since that is the “live” partition). However, the next time a snapshot is taken a new partition is created and loaded and becomes the “live” partition that is used in all joins etc.

Important note: this code currently does not clean up old unused partitions, they will just continue to exist until they are cleaned up by some external process.

Parameters:
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • pool (Parameter, optional, insignificant) – Default is None.
  • warehouse_path (Parameter, configurable) – A URL location of the data warehouse. Default is pulled from hive.warehouse_path.
class edx.analytics.tasks.util.hive.OverwriteAwareHiveQueryDataTask(*args, **kwargs)

A generalized Data task whose output is a hive table populated from a hive query.

Parameters:
  • overwrite (BoolParameter, optional, insignificant) – Whether or not to overwrite existing outputs; set to False by default for now.
  • overwrite_target_partition (BoolParameter, optional, insignificant) – Overwrite the target partition, deleting any existing data. This will not impact other partitions. Do not use with incrementally built partitions. Default is True.
  • pool (Parameter, optional, insignificant) – Default is None.
  • warehouse_path (Parameter, configurable) – A URL location of the data warehouse. Default is pulled from hive.warehouse_path.

util.url

Support URLs. Specifically, we want to be able to refer to data stored in a variety of locations and formats using a standard URL syntax.

Examples:

s3://some-bucket/path/to/file
/path/to/local/file.gz
hdfs://some/directory/
class edx.analytics.tasks.util.url.ExternalURL(*args, **kwargs)

Simple Task that returns a target based on its URL

Parameters:url (Parameter) –
class edx.analytics.tasks.util.url.UncheckedExternalURL(*args, **kwargs)

A ExternalURL task that does not verify if the source file exists, which can be expensive for S3 URLs.

Parameters:url (Parameter) –