Strategies for Testing Data Warehouse Applications
Testing Goals
- Data completeness.
Ensures that all expected data is loaded. - Data transformation.
Ensures that all data is transformed correctly according to business rules and/or design specifications. - Data quality.
Ensures that the ETL application correctly rejects, substitutes default values, corrects or ignores and reports invalid data. - Performance and scalability.
Ensures that data loads and queries perform within expected time frames and that the technical architecture is scalable. - Integration testing.
Ensures that the ETL process functions well with other upstream and downstream processes. - User-acceptance testing.
Ensures the solution meets users' current expectations and anticipates their future expectations. - Regression testing.
Ensures existing functionality remains intact each time a new release of code is completed.
Data Completeness
Preconditions.
1. Objects are deinstalled successfully.
2. Objects are installed successfully.
3. -init scripts have been fulfilled.
4. the n.a., n.d. and initial data is in dimensions, facts are empty.
DIM. Comparing record counts between source and target data.
Script:
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT COUNT ( * ) AS cnt
FROM wrk_table_contacts
MINUS
SELECT COUNT ( * ) AS cnt
FROM mv_table_for_dataload );
Strategy:
1. Run loading process.
2. Run loading process again.
3. Test target minus source.
4. Test source minus target.
DIM. Table is loaded with default IDs.
Script:
-- Verify total rows
SELECT CASE WHEN SUM ( 1 ) >= 2 THEN 'OK' ELSE 'ERROR' END AS is_filled
FROM ( SELECT column_id AS total
FROM dim_table WHERE column_id IN ('-98', '-99');
DIM. The same script but sometimes it's needed to exclude default values.
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT table_channel_code AS cnt
FROM dim_table_channels
WHERE table_channel_id NOT IN (-98, -99)
MINUS
SELECT DISTINCT table_channel AS cnt
FROM wrk_table_contacts
WHERE table_channel NOT IN ('n.a', 'n.d.') );
DIM. Populating the full contents of each field to validate that no truncation occurs at any step in the process.
Script:
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT user_id_ticket
, user_id
, table_theme_id
, table_channel
, table_channel_id
, table_status
FROM wrk_table_contacts
MINUS
SELECT user_id_ticket
, user_id
, table_theme_id
, table_channel
, table_channel_id
, table_status
FROM mv_table_for_dataload );
FCT. Table is loaded with dimension IDs.
Script:
-- Verify total rows
SELECT CASE WHEN SUM ( 1 ) > 0 THEN 'OK' ELSE 'ERROR' END AS is_filled
FROM ( SELECT column_with_id AS total
FROM fct_table WHERE column_with_id NOT IN ('-98', '-99');
FCT. Comparing record counts between source and target data.
Script:
-- Verify total count of tickets
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT COUNT ( table_srcid ) AS cnt
FROM mv_table_for_dataload
MINUS
SELECT COUNT ( table_srcid ) AS cnt
FROM fct_table_contacts );
FCT. Find lost data.
-- Verify total count of lost data
SELECT COUNT ( * ) AS LOST
FROM mv_table_for_dataload a
, fct_table_contacts b
WHERE a.table_srcid = b.table_srcid(+)
AND a.contact_srcid = b.contact_srcid(+)
AND b.contact_srcid IS NULL );
Data Transformation
Example 1. Validate that data types in the warehouse are as specified in the design and/or the data model.
Manually.
Strategy:
1. Review ETL code.
2. Wait couple of days.
3. Review ETL code again. If there are defects then repeat from 1 point.
Example 2. SCD transformations. Find wrong flags.
Test 1:
SELECT COUNT ( * ) AS ERRORS
FROM dim_table_contact_actions_scd
WHERE valid_to = TO_DATE ( '31.12.9999'
, 'DD.MM.YYYY' )
AND active_flag = 'N'
AND table_contact_action_surr_id NOT IN (-98, -99);
Test 2:
SELECT COUNT ( * ) AS ERRORS
FROM(SELECT table_contact_action_surr_id, table_contact_action_id, valid_from, valid_to, active_flag cnt
FROM dim_table_contact_actions_scd
WHERE valid_to != TO_DATE ( '31.12.9999'
, 'DD.MM.YYYY' )
AND active_flag = 'Y'
AND table_contact_action_surr_id NOT IN (-98, -99));
Example 3. SCD transformations. Find lost time frames in dimension. Usefull test, use it! Service team will be happy.
Script:
SELECT SUM ( CASE WHEN valid_to <> NVL ( valid_from_next, '31.12.9999' ) THEN '1' ELSE '0' END ) AS ERRORS
FROM ( SELECT table_contact_action_surr_id
, table_contact_action_id
, valid_to
, LEAD ( valid_from ) OVER (PARTITION BY table_contact_action_id ORDER BY valid_from ASC)
AS valid_from_next
FROM dim_table_contact_actions_scd
WHERE table_contact_action_id NOT IN (-99, -98) ) );
Example 4. SCD transformations. Verify time filter.
Manually.
...
AND table.event_dt >= user.valid_from(+)
AND table.event_dt < user.valid_to(+) --> if (+) then verify that there is NVL.
...
Don't use BETWEEN function in this case!!!
Data Quality
Example 1. Verify all dimension IDs in fact table, every dim_id.
Script:
SELECT COUNT ( * ) AS ERRORS
FROM ( SELECT table_surr_id
FROM fct_table_contacts
MINUS
SELECT table_surr_id
FROM dim_table_scd );
Example 2. Verify all level codes in the filters.
...
FROM DIM_table_SCD prod,
...
Verify filter -> WHERE prod.level_code = 'USER'
Example 3. Verify comments in code.
...
WHERE system_id = 1
...
Verify filter -> -- source system
Example 4. Verify comments of db objects.
TableStructure.sql - generate the html file with table structure and comments.
Performance and scalability
Example 1.
DataAmount.sql - shows the size and growth index of fact table.
Example 2. Analyse log table.
SELECT *
FROM lib.vw_log_t
WHERE SYSTEM like '%Project Title%'
AND timestamp > TO_DATE('10.05.2010', 'DD.MM.YYYY')
ORDER BY timestamp DESC;
Regression Testing
Building automation during system testing will make the process of regression testing much smoother.
Fitnesse is a good point.