Dear Developers,
at dCache we try to implement native CTA driver by following the official docs at Archival - EOSCTA Docs . We are able to send the archive request and even see files being written to virtual tape, including corresponding data transfer tom the tape mover. However, the DB tables are empty. Moreover, we don’t see status update callback from CTA. Any thoughts?
Thanks in advance,
Tigran.
Hi Tigran,
Without the details of what you have changed it is difficult to answer your question. I can only say that the tape server daemon calls the cta::ArchiveMount::reportJobsBatchTransferred()
method to insert one row into the TAPE_FILE
table for each file in a batch of files that it has just written to tape. Do you know if this method is being called?
Regards,
Steve
Hi Steve,
The hart of the archive methods looks like this:
try {
uint64_t archiveFileId = m_scheduler->checkAndGetNextArchiveFileId(instance, storageClass, requester, *m_log);
cta::common::dataStructures::ArchiveRequest archiveRequest;
cta::checksum::ProtobufToChecksumBlob(request->file().csb(), archiveRequest.checksumBlob);
archiveRequest.diskFileInfo.owner_uid = 0;
archiveRequest.diskFileInfo.gid = 0;
archiveRequest.diskFileInfo.path = "";
archiveRequest.diskFileID = request->file().fid();
archiveRequest.fileSize = request->file().size();
archiveRequest.requester.name = request->cli().user().username();
archiveRequest.requester.group = request->cli().user().groupname();
// The original code
// archiveRequest.srcURL = request->wf().instance().url();
archiveRequest.storageClass = storageClass;
archiveRequest.srcURL = request->transport().dst_url();
archiveRequest.archiveReportURL = request->transport().report_url();
archiveRequest.archiveErrorReportURL = request->transport().error_report_url();
archiveRequest.creationLog.host = context->peer();
archiveRequest.creationLog.username = instance;
archiveRequest.creationLog.time = time(nullptr);
std::string reqId = m_scheduler->queueArchiveWithGivenId(archiveFileId, instance, archiveRequest, *m_log);
m_log->log(cta::log::INFO, "Archive request for storageClass: " + storageClass
+ " archiveFileId: " + std::to_string(archiveFileId)
+ "RequestID: " + reqId);
response->set_fid(archiveFileId);
response->set_reqid(reqId);
} catch (cta::exception::Exception &ex) {
m_log->log(cta::log::CRIT, ex.getMessageValue());
return ::grpc::Status(::grpc::StatusCode::INTERNAL, ex.getMessageValue());
}
which is efficiently the merged version of CREATE and CLOSEW methods.
Thanks,
Tigran.
After a deeper dive into the code, it looks like that Scheduler#getNextArchiveJobsToReportBatch
returns an empty list. No idea why.
tigran
14 October 2021 16:05
5
This is what we see in the log file:
Oct 14 17:54:04 zitosm-devel01.desy.de cta-taped[7898]: LVL="DEBUG" PID="7898" TID="8034" MSG="In MigrationReportPacker::reportDriveStatus(), pushing a report." thread="TapeWrite" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25" type="ReportDriveStatus" Status="Unmount"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="DEBUG" PID="7898" TID="8034" MSG="In MigrationReportPacker::reportDriveStatus(), pushing a report." thread="TapeWrite" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25" type="ReportDriveStatus" Status="Free"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="INFO" PID="7898" TID="8034" MSG="TapeWriteSingleThread : tape unmounted" thread="TapeWrite" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="INFO" PID="7898" TID="8034" MSG="Tape thread complete" thread="TapeWrite" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25" type="write" mountTime="0.118913" positionTime="0.012067" waitInstructionsTime="0.110644" checksumingTime="0.036722" readWriteTime="0.572365" waitDataTime="30.784628" waitReportingTime="0.000340" flushTime="0.007768" unloadTime="0.017616" unmountTime="1.094139" encryptionControlTime="0.003525" transferTime="31.512467" totalTime="32.648351" dataVolume="74880409" headerVolume="480" files="1" payloadTransferSpeedMBps="2.293543" driveTransferSpeedMBps="2.293558" status="success"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="DEBUG" PID="7898" TID="8034" MSG="In MigrationReportPacker::reportEndOfSession(), pushing a report." thread="TapeWrite" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25" type="ReportEndofSession" mountTime="0.118913" positionTime="0.012067" waitInstructionsTime="0.110644" checksumingTime="0.036722" readWriteTime="0.572365" waitDataTime="30.784628" waitReportingTime="0.000340" flushTime="0.007768" unloadTime="0.017616" unmountTime="1.094139" encryptionControlTime="0.003525" transferTime="31.512467" totalTime="32.648351" dataVolume="74880409" headerVolume="480" files="1" payloadTransferSpeedMBps="2.293543" driveTransferSpeedMBps="2.293558" status="success"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="DEBUG" PID="7898" TID="7898" MSG="Deleted threads in DiskReadThreadPool::~DiskReadThreadPool" thread="MainThread" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25" threadID="9"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="INFO" PID="7898" TID="7898" MSG="MigrationMemoryManager destruction : all memory blocks have been deleted" thread="MainThread" tapeDrive="VLSTK10" tapeVid="V01002" mountId="25"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="INFO" PID="7898" TID="7898" MSG="In Agent::removeAndUnregisterSelf(): Removed agent object." agentObject="DriveProcess-VLSTK10-zitosm-devel01.desy.de-7898-20211014-17:52:32-0"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Aborting cta-taped on uncaught exception. Stack trace follows." Message="Uncaught exception of type 'std::logic_error' in Thread.run(): >>>><<<< End of uncaught exception"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="0" traceFrame="/lib64/libctacommon.so.0(cta::exception::Backtrace::Backtrace(bool)+0x6e) [0x7fbb5a4345fa]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="1" traceFrame="/lib64/libctacommon.so.0(cta::exception::Exception::Exception(std::string const&, bool)+0x89) [0x7fbb5a435e91]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="2" traceFrame="/lib64/libctacommon.so.0(cta::threading::UncaughtExceptionInThread::UncaughtExceptionInThread(std::string const&)+0x37) [0x7fbb5a450787]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="3" traceFrame="/lib64/libctacommon.so.0(cta::threading::Thread::wait()+0x13b) [0x7fbb5a45009d]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="4" traceFrame="cta-taped(castor::tape::tapeserver::daemon::TaskWatchDog::stopAndWaitThread()+0x39) [0x59497f]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="5" traceFrame="cta-taped(castor::tape::tapeserver::daemon::DataTransferSession::executeWrite(cta::log::LogContext&, cta::ArchiveMount*)+0x66a) [0x58e6b4]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="6" traceFrame="cta-taped(castor::tape::tapeserver::daemon::DataTransferSession::execute()+0xbd3) [0x58c605]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="7" traceFrame="cta-taped(cta::tape::daemon::DriveHandler::runChild()+0x178b) [0x5503df]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="8" traceFrame="cta-taped(cta::tape::daemon::ProcessManager::runForkManagement()+0x3dd) [0x572a7b]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="9" traceFrame="cta-taped(cta::tape::daemon::ProcessManager::run()+0x7e) [0x571a74]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="10" traceFrame="cta-taped(cta::tape::daemon::TapeDaemon::mainEventLoop()+0x2fb) [0x53bea9]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="11" traceFrame="cta-taped(cta::tape::daemon::TapeDaemon::exceptionThrowingMain()+0x1d9) [0x53bac5]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="12" traceFrame="cta-taped(cta::tape::daemon::TapeDaemon::main()+0x37) [0x53b49d]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="13" traceFrame="cta-taped() [0x52c85a]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="14" traceFrame="cta-taped(main+0x339) [0x52cf48]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="15" traceFrame="/lib64/libc.so.6(__libc_start_main+0xf5) [0x7fbb50343555]"
Oct 14 17:54:05 zitosm-devel01.desy.de cta-taped[7898]: LVL="ERROR" PID="7898" TID="7898" MSG="Stack trace" traceFrameNumber="16" traceFrame="cta-taped() [0x52c519]"
The only use of logic_error in cta code is when getting the value of an optional that has not been set previously. What version of cta are you using?
tigran
15 October 2021 07:26
7
This is based on v4.3-1. However, looks like I found the issue. The format of reportUrl
was wrong on my side.
The error you are getting is known. It should be fixed in the next 4.3 release.
tigran
15 October 2021 07:56
9
Great. The other issue, that file checksum is expected in native endiness, in case of x86 in little format.
Thus we see error like:
"Checksum value expected=0x502653e0 actual=0xe0532650"
despite the fact, that file’s checksum is e0532650
tigran
15 October 2021 08:11
10
Is it already fixed in 1044-cta-v4-3-2-release branch? I can re-base my changes on top of it.
Release 4.3-2 is not stable yet. I would suggest rebasing on top of the production_drive_states_in_objectstore
branch
tigran
15 October 2021 12:16
12
Successful flush from dCache to CTA!
devdb13=> select * from archive_file;
archive_file_id | disk_instance_name | disk_file_id | disk_file_uid | disk_file_gid | size_in_bytes | checksum_blob | checksum_adler32 | storage_class_id | creation_time | recon
ciliation_time | is_deleted | collocation_hint
-----------------+--------------------+--------------------------------------+---------------+---------------+---------------+------------------------+------------------+------------------+---------------+------
---------------+------------+------------------
8301 | eosdev | 0000E40B6C2F92934E40BEEE249CE01557B7 | 1 | 1 | 74880409 | \x0a0808011204502653e0 | 3763545680 | 61 | 1634300099 |
1634300099 | 0 |
(1 row)
devdb13=>
tigran
15 October 2021 12:26
13
The recap:
undocumented format of report and error report URLs.
CTA expects adler32 checksum in a reverse order: dCache b1b2b3b4 => CTA b4b3b2b1.
We have tried to write with uid/gid 0.
The current build is rebased on top of 1044-cta-v4-3-2-release
branch.