Message boards :
Closed Issues :
TOT5 Workunit Stuck
Message board moderation
Author | Message |
---|---|
marsinph Send message Joined: 13 Aug 19 Deprecated: Creation of dynamic property BoincUser::$nposts is deprecated in /var/boincadm/prj/html/inc/forum.inc on line 613 Posts: 13 Credit: 2,205,884 RAC: 0 |
Hello, Set to run about two hours !? I have four WU on 4 differents hosts i7-2600K OC to 4.2Ghz After 15-18 hours : 99.2% - 99.989% Slot directory changes. Stderr.txt show "checkpoint each 2-3 minutes ! But strange task manager shows very little CPU use Should I cancel ? ( host id 4149 ) Best regards -<active_task> <project_master_url>https://boinc.tbrada.eu/</project_master_url> <result_name>tot5_51c_St9HdfLp97npeHT7T9omyZcVL_0</result_name> <checkpoint_cpu_time>3.884425</checkpoint_cpu_time> <checkpoint_elapsed_time>56160.656206</checkpoint_elapsed_time> <fraction_done>0.000000</fraction_done> <peak_working_set_size>4894720</peak_working_set_size> <peak_swap_size>2080768</peak_swap_size> <peak_disk_usage>34629</peak_disk_usage> </active_task> |
marsinph Send message Joined: 13 Aug 19 Posts: 13 Credit: 2,205,884 RAC: 0 |
Additional info On host 4152 , WU 749394 : runnig time 63.441 sec, CPU time : 12.21 sec ! ,.00000001% cpu usage. So I canceled Why to block use of a core if PRJ and the WU almost not use it ? On host 4151 , WU 749695 : already one day and one hour running and CPU time 4 sec ! at 99.987% about each 20 minutes +0.001% But there are changes in slot. checkpoint file : also chage, but not a language (only signs) The only one valid is from host 4150 , WU 749580 running 55.780sec cpu : 11974sec ( 20%) We are very very far away from the predictions of about two hours All my hosts have a power of 4GFlops Project it self estimated computing size of 40.000GFlops. So for me it would take about 10.000sec to finish Not 90.000 ! All hosts Win7 x64, with manual JRE, C++ and VisualStudio. Who can explain ? Best regards |
marsinph Send message Joined: 13 Aug 19 Posts: 13 Credit: 2,205,884 RAC: 0 |
Additional info On host 4152 , WU 749394 : runnig time 63.441 sec, CPU time : 12.21 sec ! ,.00000001% cpu usage. So I canceled Why to block use of a core if PRJ and the WU almost not use it ? On host 4151 , WU 749695 : already one day and one hour running and CPU time 4 sec ! at 99.987% about each 20 minutes +0.001% But there are changes in slot. checkpoint file : also chage, but not a language (only signs) The only one valid is from host 4150 , WU 749580 running 55.780sec cpu : 11974sec ( 20%) We are very very far away from the predictions of about two hours All my hosts have a power of 4GFlops Project it self estimated computing size of 40.000GFlops. So for me it would take about 10.000sec to finish Not 90.000 ! All hosts Win7 x64, with manual JRE, C++ and VisualStudio. Who can explain ? Best regards |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Deprecated: Creation of dynamic property BoincUser::$nposts is deprecated in /var/boincadm/prj/html/inc/forum.inc on line 613 Posts: 667 Credit: 432,784 RAC: 0 |
Please zip and upload the checkpoint, input and stderr files of the misbehaved workunits and then abort them. Thank you for reporting. If you have trouble uploading them to Internet, you can use my email tomasbrod@azet.sk Additionally, I will look at the database and see if there are any wus with outrageous times. The information that cpu was not used very much is a very important to me. It could be that the wu thinks it should be suspended. What happens when you restart boinc and/or your computer? (do it after the zip/upload) |
marsinph Send message Joined: 13 Aug 19 Posts: 13 Credit: 2,205,884 RAC: 0 |
Please zip and upload the checkpoint, input and stderr files of the misbehaved workunits and then abort them. Hello, Thank you. mail sent to your adres Best regards |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
There have been few reports of tasks running for much more than the target two hours. |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
mail sent to your adress Not received any email. Can you please upload it to Googl, Yandex, Mediafire, Mega or others? I will grant you credit for the failed tasks. |
marsinph Send message Joined: 13 Aug 19 Posts: 13 Credit: 2,205,884 RAC: 0 |
mail sent to your adress Of course the mail adress you provide and where 8 sent, arrive in my private inbox here on project !!! Please send me a valid adress by private message. Yandex block very often mail from "west" I send you by private message my personal mail. So you will be able to reply securely. Best regards |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
I am sorry that you are experiencing issues. I finally received the mail from ****@skynet.be. I do not have time to look at it immediately. The needed checkpoint files have been obtained, which is great. The files are needed to (hopefully) replicate the issue and, eventually, fix it. ... I think I am not a litlle cruncher and I think I have some experience. Of course, I will recommend my team (100th world) to not crunch at this time Of course you are not required to crunch for this project. Nobody is. Including myself. If you do not want, then you can leave. You are of course welcome, if you choose to return when the issue is solved. The workunits are estimated to run two hours on modern hardware. If they run more, then something is not right. First try restarting boinc and/or computer and if it does not help, abort the workunits and set project to suspend. As a reaction to the recent problems, the "tot5" application has been marked as beta to not waste computation resources. Thank you. |
marsinph Send message Joined: 13 Aug 19 Posts: 13 Credit: 2,205,884 RAC: 0 |
Hello Tomas Brada, Like you say you finally receive all my reports. At least on from the three. Thank you. You could answer me directly by using my mail. Thank you to not have fully publish it, but some clever , now, know it !!! I wrote private and you make it public !!! And now, this mail is submerged of spam !!! By the way you wrote about two hours running on "modern" config. Your config : https://boinc.tbrada.eu/hosts_user.php?userid=1 Is 25% more powerfull https://www.cpubenchmark.net/high_end_cpus.html At you two hours, at me more than two days !!! The only explanation is that your WU are developped on AMD / Linux architecture. By the way if i look to statistics, top users, top computers, it seem to be so ! I (and my team) will come back when it will be clarified ! Best regards. You have my private mail. So please use it and in english. |
marsinph Send message Joined: 13 Aug 19 Posts: 13 Credit: 2,205,884 RAC: 0 |
Already three days and no any explanation ! No any answer. No any response. Nothing. Nice !!! Perhaps not the best way of do to attract cruncher ! Look amount of crunchers, results on ODLK, ODLK1 and here : as good as nothing ! Of course, never answer. Not forget the best motivation of big cruncher is based on SETIBZH and the world chalenge between team. The worst is that on ODLK1, admin says he is on holiday. But not say how long ! Nice consideration for crunchers !!! |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
I looked into the issue and found no answer. Your result reports to be running only 31 seconds, but the log says it was running many hours. The checkpoint is even correctly written. When I resumed the result from a checkpoint, there was no issue. Even debugger confirmed the search is progressing. The same workunit (sent to another cruncher, because you aborted them as I told you) completed without issues in 2.5 hours on windows 10. I submitted the one workunit checkpoint that you sent for validation, and it validated ok. See. |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
If you think you can solve the issue better/faster than me, please take a look at the source code on github. Edit: I do not mean to sound rude. Seriously, I do not know where the issue is. |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
I will add a condition to terminate the task if the deadline is close or if the elapsed gflop is much over the estimate. This will at least give credit for the good part of the task. |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
Added few days ago for linux only. |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
Thanks for testing the appversion 39 (tot5/5.08). Your computer 4332 is returning tasks with very short run time It seems the real time limit is computed incorrectly. Example result 1068288 has been sent 14 Sep with deadline of 21 Sep, but the app calculated that the deadline was on 11 Sep, thus finished immediately. This is a protection against tasks missing the deadline and being wasted. Error of task 1068369 is just error in the base code of BOINC. |
Tomáš Brada Project administrator Volunteer developer Send message Joined: 3 Feb 19 Posts: 667 Credit: 432,784 RAC: 0 |
n3eo: I think the error was caused by your minimum buffer time set too high. |
©2024 Tomáš Brada