Validation Rules
Comprehensive validation requirements for DOCiD metadata schema compliance
Core Property Validation
Required Field Validation
Essential fields that must be present and valid for all publications:
Field | Validation Rule | Error Message | Example |
---|---|---|---|
document_title | Not null, 1-255 characters, UTF-8 | "Title is required and cannot exceed 255 characters" | "Traditional Medicine in the Eastern Cape" |
document_docid | Unique, format: DOCID.{INST}.{YEAR}.{NUM} | "DocID must follow format DOCID.INST.YEAR.NUM" | "DOCID.UCT.2024.001" |
doi | Valid DOI format, 10.xxxxx/xxxxx | "DOI must be valid format: 10.xxxxx/xxxxx" | "10.5555/docid.2024.001" |
resource_type_id | Must exist in resource_types table | "Invalid resource type ID" | 1 (Dataset) |
published | Valid datetime, not future unless embargoed | "Publication date cannot be in future unless embargoed" | "2024-03-15T10:00:00Z" |
Creator Validation Rules
Creator Information Validation
Validation rules for creator and contributor information:
Creator Validation Logic
def validate_creator(creator_data):
"""
Validate creator information according to DOCiD standards.
"""
errors = []
# Name validation
if not creator_data.get('given_name') and not creator_data.get('family_name'):
errors.append("Either given_name or family_name is required")
if creator_data.get('given_name') and len(creator_data['given_name']) > 100:
errors.append("Given name cannot exceed 100 characters")
if creator_data.get('family_name') and len(creator_data['family_name']) > 100:
errors.append("Family name cannot exceed 100 characters")
# ORCID validation
if creator_data.get('orcid'):
if not re.match(r'^https://orcid\.org/\d{4}-\d{4}-\d{4}-\d{3}[\dX]$',
creator_data['orcid']):
errors.append("ORCID must be valid format: https://orcid.org/0000-0000-0000-0000")
# Affiliation validation
if creator_data.get('affiliation'):
if len(creator_data['affiliation']) > 255:
errors.append("Affiliation cannot exceed 255 characters")
# Role validation
if creator_data.get('role_id'):
if not CreatorRole.query.get(creator_data['role_id']):
errors.append("Invalid creator role ID")
return errors
Creator Validation Requirements:
- Name: At least one of given_name or family_name required
- ORCID: Valid ORCID URL format if provided
- Affiliation: Maximum 255 characters
- Role: Must exist in creators_roles controlled vocabulary
- Email: Valid email format if provided
- Order: Unique ordering within publication
Organization Validation Rules
Publisher/Organization Validation
Validation for publisher and organizational affiliations:
Organization Validation Logic
def validate_organization(org_data):
"""
Validate organization information including ROR integration.
"""
errors = []
# Name validation
if not org_data.get('name'):
errors.append("Organization name is required")
elif len(org_data['name']) > 255:
errors.append("Organization name cannot exceed 255 characters")
# Type validation
if not org_data.get('type'):
errors.append("Organization type is required")
elif org_data['type'] not in VALID_ORG_TYPES:
errors.append(f"Invalid organization type: {org_data['type']}")
# ROR ID validation
if org_data.get('identifier') and org_data.get('identifier_type') == 'ror':
if not re.match(r'^https://ror\.org/[a-z0-9]+$', org_data['identifier']):
errors.append("ROR ID must be valid format: https://ror.org/xxxxxxx")
# Verify ROR ID exists (API call)
if not verify_ror_id(org_data['identifier']):
errors.append("ROR ID not found in ROR registry")
# Country validation
if org_data.get('country'):
if org_data['country'] not in ISO_COUNTRY_CODES:
errors.append("Country must use ISO 3166-1 alpha-2 code")
return errors
VALID_ORG_TYPES = [
'University', 'Research Institute', 'Government', 'NGO',
'Company', 'Foundation', 'Network', 'Other'
]
Organization Validation Requirements:
- Name: Required, maximum 255 characters
- Type: Must use controlled vocabulary
- ROR ID: Valid ROR format and verified against ROR API
- Country: ISO 3166-1 alpha-2 country codes
- Uniqueness: Same organization cannot be listed multiple times
Identifier Validation Rules
Persistent Identifier Validation
Validation rules for different types of persistent identifiers:
Identifier Type | Format | Validation Pattern | Example |
---|---|---|---|
DOI | 10.xxxxx/xxxxx | ^10\.\d+/.+$ |
10.5555/docid.2024.001 |
Handle | hdl:prefix/suffix or URL | ^(hdl:|https://hdl\.handle\.net/).+$ |
https://hdl.handle.net/20.500.12345/123 |
DocID | DOCID.INST.YEAR.NUM | ^DOCID\.[A-Z]+\.\d{4}\.\d+$ |
DOCID.UCT.2024.001 |
CSTR | CSTR:xxxxx.xx.xxxx-xxxx | ^CSTR:\d+\.\d+\.[A-Z0-9]+-\d+$ |
CSTR:16389.09.A08V-0001 |
URN | urn:namespace:identifier | ^urn:[a-z0-9][a-z0-9-]*:.+$ |
urn:nbn:za:uct-123456 |
URL | Valid HTTP/HTTPS URL | ^https?://[^\s]+$ |
https://example.org/dataset/123 |
Identifier Validation Implementation
def validate_identifier(identifier_value, identifier_type):
"""
Validate different types of persistent identifiers.
"""
patterns = {
'DOI': r'^10\.\d+/.+$',
'Handle': r'^(hdl:|https://hdl\.handle\.net/).+$',
'DocID': r'^DOCID\.[A-Z]+\.\d{4}\.\d+$',
'CSTR': r'^CSTR:\d+\.\d+\.[A-Z0-9]+-\d+$',
'URN': r'^urn:[a-z0-9][a-z0-9-]*:.+$',
'URL': r'^https?://[^\s]+$'
}
if identifier_type not in patterns:
return False, f"Unknown identifier type: {identifier_type}"
pattern = patterns[identifier_type]
if not re.match(pattern, identifier_value):
return False, f"Invalid {identifier_type} format"
# Additional validations
if identifier_type == 'DOI':
# Check DOI resolution (optional)
return validate_doi_resolution(identifier_value)
elif identifier_type == 'DocID':
# Check uniqueness in database
return validate_docid_uniqueness(identifier_value)
return True, "Valid identifier"
Date and Time Validation
Publication Date Validation
Comprehensive validation for date and time fields:
Date Validation Logic
def validate_publication_dates(publication_data):
"""
Validate all date fields in publication.
"""
errors = []
current_time = datetime.utcnow()
# Required published date
if not publication_data.get('published'):
errors.append("Published date is required")
return errors
published = parse_datetime(publication_data['published'])
# Publication year extraction
publication_year = published.year
if publication_year < 1000 or publication_year > current_time.year + 10:
errors.append(f"Publication year {publication_year} is out of valid range")
# Embargo validation
embargo_until = publication_data.get('embargo_until')
if embargo_until:
embargo_date = parse_datetime(embargo_until)
if embargo_date <= current_time:
errors.append("Embargo end date must be in the future")
if embargo_date >= published:
errors.append("Embargo end date must be before publication date")
# Created vs published date consistency
created_at = publication_data.get('created_at', current_time)
if published < created_at:
errors.append("Published date cannot be before creation date")
# Future publication validation
if published > current_time and not embargo_until:
errors.append("Future publication dates require embargo information")
return errors
Date Validation Rules:
- Published Date: Required, ISO 8601 format
- Year Range: Between 1000 and current year + 10
- Embargo Logic: End date must be future, before publication
- Chronological Order: Created ≤ Published ≤ Embargo End
- Future Dates: Require embargo justification
Language and Content Validation
Multilingual Content Validation
Validation for multilingual content and language codes:
Language Validation Example
def validate_language_content(metadata):
"""
Validate language codes and multilingual content.
"""
errors = []
# Valid ISO 639-1 language codes for African context
VALID_LANGUAGE_CODES = [
'en', 'af', 'zu', 'xh', 'sw', 'ha', 'yo', 'am', 'ar', 'fr', 'pt',
'st', 'tn', 've', 'ts', 'ss', 'nr', 'nso', 'lg', 'ak', 'tw'
]
# Validate titles with language codes
if 'titles' in metadata:
for title in metadata['titles']:
if 'lang' in title:
if title['lang'] not in VALID_LANGUAGE_CODES:
errors.append(f"Invalid language code: {title['lang']}")
if 'titleType' in title:
valid_types = ['MainTitle', 'Subtitle', 'AlternativeTitle',
'TranslatedTitle', 'Other']
if title['titleType'] not in valid_types:
errors.append(f"Invalid title type: {title['titleType']}")
# Validate abstract languages
if 'abstracts' in metadata:
for abstract in metadata['abstracts']:
if 'lang' in abstract and abstract['lang'] not in VALID_LANGUAGE_CODES:
errors.append(f"Invalid abstract language code: {abstract['lang']}")
return errors
Language Validation Requirements:
- Language Codes: ISO 639-1 standard codes
- African Languages: Extended support for African language codes
- UTF-8 Encoding: All text content must be valid UTF-8
- Title Types: Controlled vocabulary for title classifications
- Consistency: Language codes consistent across related fields
Cultural Protocol Validation
Traditional Knowledge Validation
Validation rules for cultural protocols and traditional knowledge:
Cultural Protocol Validation
def validate_cultural_protocols(metadata):
"""
Validate cultural protocols and traditional knowledge labels.
"""
errors = []
if 'cultural_protocols' not in metadata:
return errors
protocols = metadata['cultural_protocols']
# Traditional Knowledge validation
if 'traditional_knowledge' in protocols:
tk = protocols['traditional_knowledge']
if tk.get('has_tk') and not tk.get('tk_labels'):
errors.append("TK labels required when has_tk is true")
if tk.get('tk_labels'):
valid_labels = ['TK A', 'TK NC', 'TK CO', 'TK CS', 'TK CB', 'TK CL']
for label in tk['tk_labels']:
if label.get('code') not in valid_labels:
errors.append(f"Invalid TK label code: {label.get('code')}")
if not label.get('community'):
errors.append("Community name required for TK labels")
# Community consent validation
if 'community_consent' in protocols:
consent = protocols['community_consent']
if consent.get('consent_obtained') and not consent.get('consenting_authority'):
errors.append("Consenting authority required when consent obtained")
if consent.get('consent_date'):
try:
consent_date = parse_datetime(consent['consent_date'])
if consent_date > datetime.utcnow():
errors.append("Consent date cannot be in the future")
except ValueError:
errors.append("Invalid consent date format")
return errors
Cultural Protocol Requirements:
- TK Labels: Valid Local Contexts TK label codes
- Community Information: Community name required for TK labels
- Consent Documentation: Consent authority and date validation
- Access Restrictions: Consistent with publication visibility
- Benefit Sharing: Agreement details when applicable
Error Handling and Reporting
Validation Error Response Format
Standardized error response format for validation failures:
Validation Error Response
POST /api/v1/publications/publish
Content-Type: application/json
{
"document_title": "", // Invalid: empty title
"document_docid": "INVALID-FORMAT", // Invalid: wrong format
"creators": [] // Invalid: no creators
}
Response (400 Bad Request):
{
"status": "error",
"message": "Validation failed",
"error_code": "VALIDATION_ERROR",
"validation_errors": [
{
"field": "document_title",
"code": "REQUIRED_FIELD",
"message": "Title is required and cannot be empty",
"value": ""
},
{
"field": "document_docid",
"code": "INVALID_FORMAT",
"message": "DocID must follow format DOCID.INST.YEAR.NUM",
"value": "INVALID-FORMAT",
"expected_format": "DOCID.{INSTITUTION}.{YEAR}.{NUMBER}"
},
{
"field": "creators",
"code": "MINIMUM_REQUIRED",
"message": "At least one creator is required",
"minimum": 1,
"provided": 0
}
],
"request_id": "req_123456789",
"timestamp": "2024-08-15T14:30:00Z"
}
Error Response Components:
- Field-Specific Errors: Each validation error tied to specific field
- Error Codes: Machine-readable error classification
- Human-Readable Messages: Clear explanations for users
- Expected Values: Guidance on correct formats
- Request Tracking: Unique request ID for debugging