Validation Rules

Comprehensive validation requirements for DOCiD metadata schema compliance

Core Property Validation

Required Field Validation
Critical All publications

Essential fields that must be present and valid for all publications:

Field Validation Rule Error Message Example
document_title Not null, 1-255 characters, UTF-8 "Title is required and cannot exceed 255 characters" "Traditional Medicine in the Eastern Cape"
document_docid Unique, format: DOCID.{INST}.{YEAR}.{NUM} "DocID must follow format DOCID.INST.YEAR.NUM" "DOCID.UCT.2024.001"
doi Valid DOI format, 10.xxxxx/xxxxx "DOI must be valid format: 10.xxxxx/xxxxx" "10.5555/docid.2024.001"
resource_type_id Must exist in resource_types table "Invalid resource type ID" 1 (Dataset)
published Valid datetime, not future unless embargoed "Publication date cannot be in future unless embargoed" "2024-03-15T10:00:00Z"

Creator Validation Rules

Creator Information Validation
Required PublicationCreators model

Validation rules for creator and contributor information:

Creator Validation Logic
def validate_creator(creator_data):
    """
    Validate creator information according to DOCiD standards.
    """
    errors = []
    
    # Name validation
    if not creator_data.get('given_name') and not creator_data.get('family_name'):
        errors.append("Either given_name or family_name is required")
    
    if creator_data.get('given_name') and len(creator_data['given_name']) > 100:
        errors.append("Given name cannot exceed 100 characters")
        
    if creator_data.get('family_name') and len(creator_data['family_name']) > 100:
        errors.append("Family name cannot exceed 100 characters")
    
    # ORCID validation
    if creator_data.get('orcid'):
        if not re.match(r'^https://orcid\.org/\d{4}-\d{4}-\d{4}-\d{3}[\dX]$', 
                       creator_data['orcid']):
            errors.append("ORCID must be valid format: https://orcid.org/0000-0000-0000-0000")
    
    # Affiliation validation
    if creator_data.get('affiliation'):
        if len(creator_data['affiliation']) > 255:
            errors.append("Affiliation cannot exceed 255 characters")
    
    # Role validation
    if creator_data.get('role_id'):
        if not CreatorRole.query.get(creator_data['role_id']):
            errors.append("Invalid creator role ID")
    
    return errors

Creator Validation Requirements:

  • Name: At least one of given_name or family_name required
  • ORCID: Valid ORCID URL format if provided
  • Affiliation: Maximum 255 characters
  • Role: Must exist in creators_roles controlled vocabulary
  • Email: Valid email format if provided
  • Order: Unique ordering within publication

Organization Validation Rules

Publisher/Organization Validation
Required PublicationOrganization model

Validation for publisher and organizational affiliations:

Organization Validation Logic
def validate_organization(org_data):
    """
    Validate organization information including ROR integration.
    """
    errors = []
    
    # Name validation
    if not org_data.get('name'):
        errors.append("Organization name is required")
    elif len(org_data['name']) > 255:
        errors.append("Organization name cannot exceed 255 characters")
    
    # Type validation
    if not org_data.get('type'):
        errors.append("Organization type is required")
    elif org_data['type'] not in VALID_ORG_TYPES:
        errors.append(f"Invalid organization type: {org_data['type']}")
    
    # ROR ID validation
    if org_data.get('identifier') and org_data.get('identifier_type') == 'ror':
        if not re.match(r'^https://ror\.org/[a-z0-9]+$', org_data['identifier']):
            errors.append("ROR ID must be valid format: https://ror.org/xxxxxxx")
        
        # Verify ROR ID exists (API call)
        if not verify_ror_id(org_data['identifier']):
            errors.append("ROR ID not found in ROR registry")
    
    # Country validation
    if org_data.get('country'):
        if org_data['country'] not in ISO_COUNTRY_CODES:
            errors.append("Country must use ISO 3166-1 alpha-2 code")
    
    return errors

VALID_ORG_TYPES = [
    'University', 'Research Institute', 'Government', 'NGO', 
    'Company', 'Foundation', 'Network', 'Other'
]

Organization Validation Requirements:

  • Name: Required, maximum 255 characters
  • Type: Must use controlled vocabulary
  • ROR ID: Valid ROR format and verified against ROR API
  • Country: ISO 3166-1 alpha-2 country codes
  • Uniqueness: Same organization cannot be listed multiple times

Identifier Validation Rules

Persistent Identifier Validation
Critical Multiple identifier types

Validation rules for different types of persistent identifiers:

Identifier Type Format Validation Pattern Example
DOI 10.xxxxx/xxxxx ^10\.\d+/.+$ 10.5555/docid.2024.001
Handle hdl:prefix/suffix or URL ^(hdl:|https://hdl\.handle\.net/).+$ https://hdl.handle.net/20.500.12345/123
DocID DOCID.INST.YEAR.NUM ^DOCID\.[A-Z]+\.\d{4}\.\d+$ DOCID.UCT.2024.001
CSTR CSTR:xxxxx.xx.xxxx-xxxx ^CSTR:\d+\.\d+\.[A-Z0-9]+-\d+$ CSTR:16389.09.A08V-0001
URN urn:namespace:identifier ^urn:[a-z0-9][a-z0-9-]*:.+$ urn:nbn:za:uct-123456
URL Valid HTTP/HTTPS URL ^https?://[^\s]+$ https://example.org/dataset/123
Identifier Validation Implementation
def validate_identifier(identifier_value, identifier_type):
    """
    Validate different types of persistent identifiers.
    """
    patterns = {
        'DOI': r'^10\.\d+/.+$',
        'Handle': r'^(hdl:|https://hdl\.handle\.net/).+$',
        'DocID': r'^DOCID\.[A-Z]+\.\d{4}\.\d+$',
        'CSTR': r'^CSTR:\d+\.\d+\.[A-Z0-9]+-\d+$',
        'URN': r'^urn:[a-z0-9][a-z0-9-]*:.+$',
        'URL': r'^https?://[^\s]+$'
    }
    
    if identifier_type not in patterns:
        return False, f"Unknown identifier type: {identifier_type}"
    
    pattern = patterns[identifier_type]
    if not re.match(pattern, identifier_value):
        return False, f"Invalid {identifier_type} format"
    
    # Additional validations
    if identifier_type == 'DOI':
        # Check DOI resolution (optional)
        return validate_doi_resolution(identifier_value)
    elif identifier_type == 'DocID':
        # Check uniqueness in database
        return validate_docid_uniqueness(identifier_value)
    
    return True, "Valid identifier"

Date and Time Validation

Publication Date Validation
Required DateTime fields

Comprehensive validation for date and time fields:

Date Validation Logic
def validate_publication_dates(publication_data):
    """
    Validate all date fields in publication.
    """
    errors = []
    current_time = datetime.utcnow()
    
    # Required published date
    if not publication_data.get('published'):
        errors.append("Published date is required")
        return errors
    
    published = parse_datetime(publication_data['published'])
    
    # Publication year extraction
    publication_year = published.year
    if publication_year < 1000 or publication_year > current_time.year + 10:
        errors.append(f"Publication year {publication_year} is out of valid range")
    
    # Embargo validation
    embargo_until = publication_data.get('embargo_until')
    if embargo_until:
        embargo_date = parse_datetime(embargo_until)
        
        if embargo_date <= current_time:
            errors.append("Embargo end date must be in the future")
        
        if embargo_date >= published:
            errors.append("Embargo end date must be before publication date")
    
    # Created vs published date consistency
    created_at = publication_data.get('created_at', current_time)
    if published < created_at:
        errors.append("Published date cannot be before creation date")
    
    # Future publication validation
    if published > current_time and not embargo_until:
        errors.append("Future publication dates require embargo information")
    
    return errors

Date Validation Rules:

  • Published Date: Required, ISO 8601 format
  • Year Range: Between 1000 and current year + 10
  • Embargo Logic: End date must be future, before publication
  • Chronological Order: Created ≤ Published ≤ Embargo End
  • Future Dates: Require embargo justification

Language and Content Validation

Multilingual Content Validation
Optional ISO 639-1 codes

Validation for multilingual content and language codes:

Language Validation Example
def validate_language_content(metadata):
    """
    Validate language codes and multilingual content.
    """
    errors = []
    
    # Valid ISO 639-1 language codes for African context
    VALID_LANGUAGE_CODES = [
        'en', 'af', 'zu', 'xh', 'sw', 'ha', 'yo', 'am', 'ar', 'fr', 'pt',
        'st', 'tn', 've', 'ts', 'ss', 'nr', 'nso', 'lg', 'ak', 'tw'
    ]
    
    # Validate titles with language codes
    if 'titles' in metadata:
        for title in metadata['titles']:
            if 'lang' in title:
                if title['lang'] not in VALID_LANGUAGE_CODES:
                    errors.append(f"Invalid language code: {title['lang']}")
            
            if 'titleType' in title:
                valid_types = ['MainTitle', 'Subtitle', 'AlternativeTitle', 
                              'TranslatedTitle', 'Other']
                if title['titleType'] not in valid_types:
                    errors.append(f"Invalid title type: {title['titleType']}")
    
    # Validate abstract languages
    if 'abstracts' in metadata:
        for abstract in metadata['abstracts']:
            if 'lang' in abstract and abstract['lang'] not in VALID_LANGUAGE_CODES:
                errors.append(f"Invalid abstract language code: {abstract['lang']}")
    
    return errors

Language Validation Requirements:

  • Language Codes: ISO 639-1 standard codes
  • African Languages: Extended support for African language codes
  • UTF-8 Encoding: All text content must be valid UTF-8
  • Title Types: Controlled vocabulary for title classifications
  • Consistency: Language codes consistent across related fields

Cultural Protocol Validation

Traditional Knowledge Validation
Conditional Cultural protocols

Validation rules for cultural protocols and traditional knowledge:

Cultural Protocol Validation
def validate_cultural_protocols(metadata):
    """
    Validate cultural protocols and traditional knowledge labels.
    """
    errors = []
    
    if 'cultural_protocols' not in metadata:
        return errors
    
    protocols = metadata['cultural_protocols']
    
    # Traditional Knowledge validation
    if 'traditional_knowledge' in protocols:
        tk = protocols['traditional_knowledge']
        
        if tk.get('has_tk') and not tk.get('tk_labels'):
            errors.append("TK labels required when has_tk is true")
        
        if tk.get('tk_labels'):
            valid_labels = ['TK A', 'TK NC', 'TK CO', 'TK CS', 'TK CB', 'TK CL']
            for label in tk['tk_labels']:
                if label.get('code') not in valid_labels:
                    errors.append(f"Invalid TK label code: {label.get('code')}")
                
                if not label.get('community'):
                    errors.append("Community name required for TK labels")
    
    # Community consent validation
    if 'community_consent' in protocols:
        consent = protocols['community_consent']
        
        if consent.get('consent_obtained') and not consent.get('consenting_authority'):
            errors.append("Consenting authority required when consent obtained")
        
        if consent.get('consent_date'):
            try:
                consent_date = parse_datetime(consent['consent_date'])
                if consent_date > datetime.utcnow():
                    errors.append("Consent date cannot be in the future")
            except ValueError:
                errors.append("Invalid consent date format")
    
    return errors

Cultural Protocol Requirements:

  • TK Labels: Valid Local Contexts TK label codes
  • Community Information: Community name required for TK labels
  • Consent Documentation: Consent authority and date validation
  • Access Restrictions: Consistent with publication visibility
  • Benefit Sharing: Agreement details when applicable

Error Handling and Reporting

Validation Error Response Format
Standard API responses

Standardized error response format for validation failures:

Validation Error Response
POST /api/v1/publications/publish
Content-Type: application/json

{
    "document_title": "",  // Invalid: empty title
    "document_docid": "INVALID-FORMAT",  // Invalid: wrong format
    "creators": []  // Invalid: no creators
}

Response (400 Bad Request):
{
    "status": "error",
    "message": "Validation failed",
    "error_code": "VALIDATION_ERROR",
    "validation_errors": [
        {
            "field": "document_title",
            "code": "REQUIRED_FIELD",
            "message": "Title is required and cannot be empty",
            "value": ""
        },
        {
            "field": "document_docid",
            "code": "INVALID_FORMAT", 
            "message": "DocID must follow format DOCID.INST.YEAR.NUM",
            "value": "INVALID-FORMAT",
            "expected_format": "DOCID.{INSTITUTION}.{YEAR}.{NUMBER}"
        },
        {
            "field": "creators",
            "code": "MINIMUM_REQUIRED",
            "message": "At least one creator is required",
            "minimum": 1,
            "provided": 0
        }
    ],
    "request_id": "req_123456789",
    "timestamp": "2024-08-15T14:30:00Z"
}

Error Response Components:

  • Field-Specific Errors: Each validation error tied to specific field
  • Error Codes: Machine-readable error classification
  • Human-Readable Messages: Clear explanations for users
  • Expected Values: Guidance on correct formats
  • Request Tracking: Unique request ID for debugging