Commit Graph

  • fb7ce9aeeb Strip parameters for redir urls as well Ai Lin Chia 2017-08-18 12:43:41 +02:00
  • d6628e55fc Strip params from canon url Ai Lin Chia 2017-08-18 12:33:37 +02:00
  • d4df446d75 Fix unit test Ai Lin Chia 2017-08-18 10:45:55 +02:00
  • bef1d2561e Strip parameters for redir urls as well Ai Lin Chia 2017-08-18 12:43:41 +02:00
  • 964d9e4469 Strip params from canon url Ai Lin Chia 2017-08-18 12:33:37 +02:00
  • 4aa6a6db2c Fix unit test Ai Lin Chia 2017-08-18 10:45:55 +02:00
  • 5dc4950b19 Changed URL filters to delete page on first seen HTTP status >= 400 and < 500 Brian Rasmusson 2017-08-17 17:05:03 +02:00
  • c9a79a6b31 added option to verify spiderdb data Brian Rasmusson 2017-08-17 16:33:27 +02:00
  • 9d8e890ddf Changed URL filters to delete page on first seen HTTP status >= 400 and < 500 Brian Rasmusson 2017-08-17 17:05:03 +02:00
  • 1662ad0fb0 added option to verify spiderdb data Brian Rasmusson 2017-08-17 16:33:27 +02:00
  • 058cac28f5 Move uh48 check out of isCorrupt and indo dedupSpiderdbList Ai Lin Chia 2017-08-17 13:36:52 +02:00
  • 281d4548e3 Code style changes Ai Lin Chia 2017-08-17 13:32:16 +02:00
  • ad94ed8b62 Use log warn for warning logs Ai Lin Chia 2017-08-17 13:31:55 +02:00
  • 2af8af2d17 Add uh48 check to isCorrupt method Ai Lin Chia 2017-08-17 13:30:54 +02:00
  • 7c6b9731b5 Move uh48 check out of isCorrupt and indo dedupSpiderdbList Ai Lin Chia 2017-08-17 13:36:52 +02:00
  • 4c9fea33a2 Code style changes Ai Lin Chia 2017-08-17 13:32:16 +02:00
  • fe99e8450f Use log warn for warning logs Ai Lin Chia 2017-08-17 13:31:55 +02:00
  • d131676032 Add uh48 check to isCorrupt method Ai Lin Chia 2017-08-17 13:30:54 +02:00
  • 933cbb4727 Add sanity check for uh48 Ai Lin Chia 2017-08-17 12:38:11 +02:00
  • 12311e9cf2 Add sanity check for uh48 Ai Lin Chia 2017-08-17 12:38:11 +02:00
  • d16d1fdf4b Add log cache name so we'll know which cache we're logging on Ai Lin Chia 2017-08-17 11:23:03 +02:00
  • d384263f75 Fix compilation error Ai Lin Chia 2017-08-17 11:06:03 +02:00
  • 164f3e9505 Fix compilation error from previous commit Ai Lin Chia 2017-08-17 11:04:10 +02:00
  • b78999ebf6 Add spidered url cache Ai Lin Chia 2017-08-17 11:01:28 +02:00
  • 72433357c4 Add log cache name so we'll know which cache we're logging on Ai Lin Chia 2017-08-17 11:23:03 +02:00
  • 68a5f4b19f Fix compilation error Ai Lin Chia 2017-08-17 11:06:03 +02:00
  • b496fbec79 Add const Ai Lin Chia 2017-08-15 15:40:02 +02:00
  • 9e01584487 Whitespace changes Ai Lin Chia 2017-08-15 15:23:05 +02:00
  • d9375bd968 Fix compilation error from previous commit Ai Lin Chia 2017-08-17 11:04:10 +02:00
  • d04e94a538 Add spidered url cache Ai Lin Chia 2017-08-17 11:01:28 +02:00
  • c9424c37f9 Merge commit 'feba143f493064a050482b9ee4930a1ec49e89ae' into staging Brian Rasmusson 2017-08-17 10:36:38 +02:00
  • b770aa65ef Remove now unused XmlDoc::m_msgc Ai Lin Chia 2017-08-15 11:34:35 +02:00
  • 40fb177adb Only set RdbMerge::m_isMerging to false after we relinquish the lock Ai Lin Chia 2017-08-15 11:06:50 +02:00
  • 5282b2e569 Split EDOCDISALLOWED to more error codes (EDOCDISALLOWEDHTTPSTATUS, EDOCDISALLOWEDROOT) Ai Lin Chia 2017-08-15 10:41:12 +02:00
  • e3f783b134 Remove unused variable Ai Lin Chia 2017-08-14 16:57:36 +02:00
  • 6466cce67d Use isFirstUrlRobotsTxt instead of duplicating logic Ai Lin Chia 2017-08-14 14:46:36 +02:00
  • d436cd90bc Fix XmlDoc::isFirstUrlRobotsTxt check. This should return true only if robots.txt is on root path Ai Lin Chia 2017-08-14 14:42:04 +02:00
  • feba143f49 Document in Url.h which parts of the url becomes which members Ivan Skytte Jørgensen 2017-08-14 16:46:12 +02:00
  • 9e75222383 Fix unit test Ai Lin Chia 2017-08-14 10:18:29 +02:00
  • 6e04e7a14d Log an all-clear message once when all instances are alive Ivan Skytte Jørgensen 2017-08-11 14:50:24 +02:00
  • 1a5e1400bd Reactivated wantedcheck Ivan Skytte Jørgensen 2017-08-11 14:09:38 +02:00
  • 437387610e Temporarily make wantedcheck nonfunctional Ivan Skytte Jørgensen 2017-08-10 16:10:39 +02:00
  • cd6699cc84 Called shlib (wanted-check) in UrlBlockList::isUrlBlocked() Ivan Skytte Jørgensen 2017-08-10 14:13:23 +02:00
  • c67f73cdca Reanemd functions in example shlib Ivan Skytte Jørgensen 2017-08-10 13:48:28 +02:00
  • 2bb8892141 Forgotten file Ivan Skytte Jørgensen 2017-08-10 13:38:07 +02:00
  • 1b408b5bcf Added example filter shlib Ivan Skytte Jørgensen 2017-08-08 16:39:59 +02:00
  • 798614b4a7 Added callouts to shlib for domain/url/content analysis Ivan Skytte Jørgensen 2017-08-08 15:37:43 +02:00
  • a8ea049001 Added callouts to shlib for domain/url/content analysis Ivan Skytte Jørgensen 2017-08-08 15:06:08 +02:00
  • 4b2e7d387c admin-page: nuke-doledb button Ivan Skytte Jørgensen 2017-08-07 16:09:36 +02:00
  • 02e6879985 Nuke doledb periodically to hide the bugs/limitations in the spidering stuff Ivan Skytte Jørgensen 2017-08-07 15:26:29 +02:00
  • 06d16b2865 Stop XmlDoc::getMetaList() from coredumping if getNewSpiderReply blocks in tagdb Ivan Skytte Jørgensen 2017-08-07 13:34:35 +02:00
  • 56877ee607 Removed local declaration of nukeDoledb() Ivan Skytte Jørgensen 2017-08-04 16:05:09 +02:00
  • edddd82013 Removed obsolete commentsfrom SpiderColl.cpp Ivan Skytte Jørgensen 2017-08-04 15:14:44 +02:00
  • a8a4b08c7e goto-loop -> for(;;) loop Ivan Skytte Jørgensen 2017-08-04 15:00:43 +02:00
  • 37c27ab332 Switched SpiderColl from using Process::abortXxxx to using gbShutdownXxxx() Ivan Skytte Jørgensen 2017-08-04 13:33:01 +02:00
  • e3845e6be1 More const Ivan Skytte Jørgensen 2017-08-04 13:05:57 +02:00
  • 761d24f026 added human readable spidertime to doledb dump Brian Rasmusson 2017-08-03 22:36:24 +02:00
  • 9e393f9bf5 admin-page: spider: include table over spidercoll->m_nextKeys[] Ivan Skytte Jørgensen 2017-08-03 16:21:14 +02:00
  • c865874e55 Use thread-safe variant of KEYSTR() in SpiderLoop.cpp Ivan Skytte Jørgensen 2017-08-03 16:00:28 +02:00
  • 705283ab80 Use &hellip; instead of ... Ivan Skytte Jørgensen 2017-08-03 14:44:11 +02:00
  • aaf3737811 More const in SpiderLoop.* Ivan Skytte Jørgensen 2017-08-03 14:07:23 +02:00
  • 293b7080da Removed superfluous 'continue' statements at end of loops Ivan Skytte Jørgensen 2017-07-31 16:03:50 +02:00
  • 3283c39882 More debug log in SpiderLoop.cpp Ivan Skytte Jørgensen 2017-07-31 12:49:20 +02:00
  • 68ab3c139e Changed hardcoded doledb fetch size from 50000 to a local constant (150000) Ivan Skytte Jørgensen 2017-07-31 12:39:02 +02:00
  • 6a962fffa3 Dropped 'shortcut' local variable used only in a single line Ivan Skytte Jørgensen 2017-07-31 12:35:28 +02:00
  • 04f8dff2ce More const in SpiderLoop.cpp Ivan Skytte Jørgensen 2017-07-31 12:31:12 +02:00
  • 98cf0ad703 Handle tabs in SafeBuf::brify() Ivan Skytte Jørgensen 2017-07-28 16:12:02 +02:00
  • 8b53071353 Dodcumneted the more obscure dump-tagdb commands Ivan Skytte Jørgensen 2017-07-28 16:08:56 +02:00
  • 50c1f5406e When dumping DBs from cmdline use larger 'recSizes' for Msg5 (was: 1MB. Now: 10MB) Ivan Skytte Jørgensen 2017-07-28 15:58:54 +02:00
  • 2722eb1ccc make spiderdb-dump work as intended by using the firstIp cmdline argument correctly Ivan Skytte Jørgensen 2017-07-28 15:43:09 +02:00
  • 67a6ce097e Fix cmdline help to include all parameters for db-dumping Ivan Skytte Jørgensen 2017-07-28 15:35:07 +02:00
  • 6873eda2f7 Made cmd-line parsing a bit cleaner for dump-spider-db Ivan Skytte Jørgensen 2017-07-28 15:20:25 +02:00
  • 4955604eeb word-wrap cmdline help to terminal width instead of hardcoded 60 Ivan Skytte Jørgensen 2017-07-28 15:19:13 +02:00
  • c9dbc15ad3 corrected detection of highest inlink siterank for first doc key Brian Rasmusson 2017-07-28 11:55:14 +02:00
  • 209f3f3cec Don't clean up crashed/unfinished merges when just launching a command-line. Ivan Skytte Jørgensen 2017-07-27 15:34:00 +02:00
  • 49b82f4518 fix comment (old int32_t search+replace) Ivan Skytte Jørgensen 2017-07-27 14:55:37 +02:00
  • 7639aebc58 added .dic to list of extensions not to index Brian Rasmusson 2017-07-27 14:54:04 +02:00
  • ef9d9a1580 Removed unused CollectioNRec::m_END_COPY Ivan Skytte Jørgensen 2017-07-27 14:29:47 +02:00
  • 44729bc0a6 More const in Spider.* Ivan Skytte Jørgensen 2017-07-27 14:12:41 +02:00
  • f3db6ee1d6 bugfix use of uninitialized memory when using negative terms Ivan Skytte Jørgensen 2017-07-27 13:41:25 +02:00
  • b90d8c7004 Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2017-07-27 13:20:03 +02:00
  • fca0165092 Add missed file in previous commit Ai Lin Chia 2017-07-26 15:13:31 +02:00
  • 1a944cc25b Add more test Ai Lin Chia 2017-07-26 15:12:15 +02:00
  • f79bb9542d Add wildcard to dnsblocklist + unit test for simple cases Ai Lin Chia 2017-07-26 15:11:11 +02:00
  • 1c1d3c00b9 Extract temp error to isSpiderTempError function Ai Lin Chia 2017-07-26 14:38:38 +02:00
  • 7e9059c954 Add deletion of httpstatus 5xx to urlfilters. Fix httpstatus urlfilter check Ai Lin Chia 2017-07-26 14:37:44 +02:00
  • 2c0c034795 Add unit test for GbCache Ai Lin Chia 2017-07-26 11:40:48 +02:00
  • e4ef33c370 Group blocked documents into deleted stats Ai Lin Chia 2017-07-26 11:40:27 +02:00
  • 6ce9e7f38c Don't expand iframe with 0/1 width & height Ai Lin Chia 2017-07-25 16:29:31 +02:00
  • 391f8f5a4f admin page doldbip table: sort IP-addreses in network-order Ivan Skytte Jørgensen 2017-07-25 16:21:51 +02:00
  • 68a4cddbcc More const in Spider.* Ivan Skytte Jørgensen 2017-07-25 16:09:58 +02:00
  • ca7980b153 Added const variant of SpiderColl::getCollectionRec() Ivan Skytte Jørgensen 2017-07-25 15:37:23 +02:00
  • e45fa92320 We should only add ns response to cache when we already have an IP cache record for it Ai Lin Chia 2017-07-25 15:09:36 +02:00
  • 1dd5372311 Fix (unlikely) race condition in getUrlFilterNum() Ivan Skytte Jørgensen 2017-07-25 15:05:57 +02:00
  • b28ef22088 Removed obsolete hack for collection 'test' Ivan Skytte Jørgensen 2017-07-25 14:34:34 +02:00
  • d37ab5b89c Add caching to GbDns Ai Lin Chia 2017-07-25 12:53:27 +02:00
  • 6e87ce153a Enable round-robin of dns server Ai Lin Chia 2017-07-24 16:37:18 +02:00
  • 24cbf11212 Save waittingtree / spidercache->spidercoll->m_waitingTree together with all the other trees. Ivan Skytte Jørgensen 2017-07-24 16:32:50 +02:00
  • eb7d19e436 Make admin page doledbip nicer in case there is no spider collection (eg. when not spidering) Ivan Skytte Jørgensen 2017-07-24 15:58:30 +02:00
  • 903019c6fc Reanmed SpiderColl::m_localTable to m_siteIndexedDocumentCount Ivan Skytte Jørgensen 2017-07-24 15:39:28 +02:00