Commit Graph

  • cdcd982cad use more aggressive buffer allocation in SpiderdbRdbSqliteBridge::getList() Ivan Skytte Jørgensen 2017-10-19 12:48:26 +02:00
  • 503b0c8494 Use correct bfufer for testing while fetching subset of spiderdb Ivan Skytte Jørgensen 2017-10-19 12:41:56 +02:00
  • dd19a61b94 Merge branch 'sqlite' of github.com:privacore/open-source-search-engine into sqlite Ivan Skytte Jørgensen 2017-10-19 11:54:42 +02:00
  • 9c40ad3ccd Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-10-19 11:52:29 +02:00
  • 4359acc493 Ensure we're able to call ./gb installfile from outside instance directory Ai Lin Chia 2017-10-18 16:46:53 +02:00
  • 854ff0c6f2 ./gb install should install c-ares library file as well Ai Lin Chia 2017-10-18 16:46:28 +02:00
  • f299f5934e Make sure we don't need to be in 'current' instance to be able to install gb based on hosts.conf file Ai Lin Chia 2017-10-18 15:56:51 +02:00
  • 387b05928d Updated compilation dependencies for ubuntu & fedora Ai Lin Chia 2017-10-18 13:40:36 +02:00
  • beb58cbe48 Fix unit test Ai Lin Chia 2017-10-18 13:37:04 +02:00
  • 2c22f03d82 Verify that domain is actually domain. Add separate set for simple domain check Ai Lin Chia 2017-10-18 11:08:21 +02:00
  • 4de88c0048 Make sure spiderdb:m_reply is set zero if it was null and an error occured Ivan Skytte Jørgensen 2017-10-17 16:44:11 +02:00
  • 0a64bfe99a Document hostsuffix in urlblacklist.txt.example Ai Lin Chia 2017-10-17 15:43:20 +02:00
  • 501651701a Add hostsuffix to urlmatchlist Ai Lin Chia 2017-10-17 15:40:14 +02:00
  • b8bf38a97f sqlite bridge: set spiderrequest.m_siteNumInlinksValid correctly Ivan Skytte Jørgensen 2017-10-17 13:08:16 +02:00
  • a21f4fde99 Fixed spiderdb(sqlite) update statement Ivan Skytte Jørgensen 2017-10-17 12:59:51 +02:00
  • f5d203f831 Switch to read-only mode when dumping spiderdb(sqlite) Ivan Skytte Jørgensen 2017-10-17 12:29:53 +02:00
  • 8e07d88c00 Cleaned up spiderdbSqlite.h Ivan Skytte Jørgensen 2017-10-17 12:22:47 +02:00
  • ad17ed1087 Made code a bit clearer wrt. bitfields Ivan Skytte Jørgensen 2017-10-17 12:13:59 +02:00
  • 3af138e6a8 Log accurate UrlMatchList counts Ai Lin Chia 2017-10-16 17:20:21 +02:00
  • 95789b9b78 Add missing files for unit test Ai Lin Chia 2017-10-16 17:19:45 +02:00
  • d1e4ca343b spiderdb/sqlite: reply::m_hasAuthorityInlink was only used for updating requetss. Now not persisted Ivan Skytte Jørgensen 2017-10-16 15:50:02 +02:00
  • e560f72160 spiderdb/sqlite: Use bitfield structs in all places in SpiderdbRdbSqliteBridge.cpp Ivan Skytte Jørgensen 2017-10-16 15:26:06 +02:00
  • cc7d592880 Only create UrlParser object once Ai Lin Chia 2017-10-16 15:17:24 +02:00
  • 8cb75b16e6 Add const version of UrlParser::matchQueryParam Ai Lin Chia 2017-10-16 15:16:51 +02:00
  • 641b77d939 Use range iterator instead of loop. Use emplace_back instead of push_back Ai Lin Chia 2017-10-16 15:16:16 +02:00
  • 3828e73c36 spiderreplyies with err=forceddelete should also result in a delete Ivan Skytte Jørgensen 2017-10-16 15:14:54 +02:00
  • ca8860b576 Optimize simple domain & host matching for UrlMatchList Ai Lin Chia 2017-10-16 14:20:47 +02:00
  • 933643f4f0 dump-spiderdb-cvs: fix flag bit interpretation and made output a bit nicer Ivan Skytte Jørgensen 2017-10-16 14:17:47 +02:00
  • 476d6976b2 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-10-16 13:24:03 +02:00
  • ed5f01f4a6 Made bitfield use in spiderdb (sqlite) consistent Ivan Skytte Jørgensen 2017-10-16 13:23:26 +02:00
  • 2cbd147dff Add unit test for UrlMatchList. Cater for 'dir' in filename Ai Lin Chia 2017-10-16 12:47:46 +02:00
  • 428def45a0 Set initial size of html-entities-table better so it doesn't cause a resize every time Ivan Skytte Jørgensen 2017-10-16 12:19:40 +02:00
  • 41948581e1 Add DirTest to test getNextFilename pattern Ai Lin Chia 2017-10-16 12:08:30 +02:00
  • da2bbc86d4 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-10-16 12:01:08 +02:00
  • 7ac7a76856 Allow for multiple urlblacklist Ai Lin Chia 2017-10-13 14:52:33 +02:00
  • a48501f6ef Spidering appears to work now with sqlite instead of rdb. Still needs some cleanup Ivan Skytte Jørgensen 2017-10-13 15:27:16 +02:00
  • 1785a50474 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-10-13 15:10:48 +02:00
  • 2830273647 Removed nowGlobalMS parameter from SpiderColl::addSpiderRequest() Ivan Skytte Jørgensen 2017-10-13 15:08:33 +02:00
  • 57723624a5 Use fnmatch in Dir::getNextFilename for pattern matching instead of checking for wildcard character Ai Lin Chia 2017-10-13 14:32:37 +02:00
  • 2bec1728ea Rename m_hostId to m_myHostId Ai Lin Chia 2017-10-13 13:03:08 +02:00
  • 6667aa0990 Remove commented out code Ai Lin Chia 2017-10-13 13:02:12 +02:00
  • e1be5d3d47 Make parent directory as well Ai Lin Chia 2017-10-13 11:51:36 +02:00
  • 5d98eb1cb6 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-10-12 15:42:46 +02:00
  • 7c2456ac71 SpiderColl.cpp: Don't use plain log(fmt...) but instead log(level,fmt...) Ivan Skytte Jørgensen 2017-10-12 15:28:44 +02:00
  • 5f28d2eddc We need to truncate file before we write to avoid having old dangling data Ai Lin Chia 2017-10-12 11:33:52 +02:00
  • 454cf12dc2 Fix a bug introduced by commit 95023f6563 that causes Msg4 to be decoded incorrectly in query host when receiving spiderdb records Ai Lin Chia 2017-10-11 17:14:11 +02:00
  • 6effc91391 Don't insert into spiderdb when it's record not found; Set index code to ENOTFOUND as well Ai Lin Chia 2017-10-11 14:14:51 +02:00
  • c0e996eccf Don't check blocklist when it's docdelete Ai Lin Chia 2017-10-11 14:14:38 +02:00
  • fc54b8ffcb Add json reply to PageSpiderdbLookup Ai Lin Chia 2017-10-11 11:23:31 +02:00
  • 7de31a3066 Add g_conf.m_logTracePageSpiderdbLookup for enabling/disabling trace logs for PageSpiderdbLookup Ai Lin Chia 2017-10-10 15:36:24 +02:00
  • 646350f62b Add logHexTrace Ai Lin Chia 2017-10-10 15:34:59 +02:00
  • 9785fec43b It compiles, links starts and can put seeds into the sqlite database Ivan Skytte Jørgensen 2017-10-10 15:54:16 +02:00
  • 50b363b8fe Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-10-10 15:18:26 +02:00
  • 31dd1f6b3f Untangled one goto in SpiderColl.cpp Ivan Skytte Jørgensen 2017-10-10 15:16:36 +02:00
  • 086b9cdc3b Add delete spiderdb by url feature Ai Lin Chia 2017-10-10 14:08:12 +02:00
  • 011ab0448d Make local function static Ai Lin Chia 2017-10-10 14:07:52 +02:00
  • e35bb8b3a4 Store index code in TitleRec Ai Lin Chia 2017-10-10 12:47:11 +02:00
  • 0e3a420b0b Initialize url black/white list Ai Lin Chia 2017-10-10 11:57:40 +02:00
  • de0d5ca02d Fix compilation error Ai Lin Chia 2017-10-10 10:45:22 +02:00
  • 82c3410544 A root & valid url is probably not a bad link Ai Lin Chia 2017-10-09 17:27:02 +02:00
  • 830d4fbc96 Merge branch 'sqlite' of github.com:privacore/open-source-search-engine into sqlite Ivan Skytte Jørgensen 2017-10-09 17:21:47 +02:00
  • 4b95846a97 Make sure sqlite m_firstIp is not negative Ivan Skytte Jørgensen 2017-10-09 17:15:22 +02:00
  • 0766c68b6f Added spiderdb->sqlite conversion command Ivan Skytte Jørgensen 2017-10-06 12:13:03 +02:00
  • 977de7ba07 Don't print unwanted urls (if it's already unwanted, it's already removed) Ai Lin Chia 2017-10-09 15:46:13 +02:00
  • 041f174dd1 Don't auto enable spidering when doing a page reindex. Ai Lin Chia 2017-10-09 15:09:35 +02:00
  • e4df7078e1 Fix installfile command. Compare host instead of shard. If we have query & spider host, it will have the same shard. Ai Lin Chia 2017-10-09 14:59:31 +02:00
  • 062295eb26 Only print uniq urls Ai Lin Chia 2017-10-09 14:56:03 +02:00
  • 249cc317df Only dump bad links now Ai Lin Chia 2017-10-09 14:40:40 +02:00
  • d2824988c1 Don't use list offset when calculating ptr Ai Lin Chia 2017-10-09 14:33:56 +02:00
  • 95023f6563 Moved record-size-check and jump-to-next statements to more logical places (will also make sqlite changes easier to merge) Ivan Skytte Jørgensen 2017-10-09 13:00:49 +02:00
  • 52a3669b98 Use a thread name for mergecoordinator-hold-lock thread Ivan Skytte Jørgensen 2017-10-09 12:14:46 +02:00
  • 3f48a6e425 fix crash when viewing Page Info in a setup with no spider hosts. getHostIdWithSpideringEnabled shut down if no spider host was found - now a new function param specifies if a spider host is required or not Brian Rasmusson 2017-10-09 11:25:17 +02:00
  • 8ebedb439f Don't check blocklist when doing rebuild Ai Lin Chia 2017-10-07 14:03:32 +02:00
  • 02ac6c2709 More fixes to make sure we don't hash middomain Ai Lin Chia 2017-10-06 18:26:22 +02:00
  • 84b9fe27cc Remove gbispermalink Ai Lin Chia 2017-10-06 17:46:14 +02:00
  • 32e04fe2ff Fix urlencode for url for character that is bigger than 3 Ai Lin Chia 2017-10-06 17:12:17 +02:00
  • 2199349e7e Minor cleanup in PageCrawlBot Ivan Skytte Jørgensen 2017-10-06 17:09:23 +02:00
  • bc9f783b47 Don't index mid domain for redirect/canonical error documents Ai Lin Chia 2017-10-06 17:04:30 +02:00
  • bc84788c11 Remove gbisadult Ai Lin Chia 2017-10-06 16:42:31 +02:00
  • a1bf803d37 Added helpful comment Ivan Skytte Jørgensen 2017-10-06 15:51:23 +02:00
  • 7d554c5309 Bugfix incomplete cleanup in Collectiondb::cleanTrees() Ivan Skytte Jørgensen 2017-10-06 15:34:34 +02:00
  • 6489c1ea8c Fixed logging Ai Lin Chia 2017-10-06 14:47:30 +02:00
  • 7aba613763 Remove unused parameter Ai Lin Chia 2017-10-06 13:49:28 +02:00
  • 7975714082 Uhm, dependencies didn't catch the compilation error? Ivan Skytte Jørgensen 2017-10-06 14:51:14 +02:00
  • bad4eac887 Moved collnum_t type to separate header file Ivan Skytte Jørgensen 2017-10-06 14:32:45 +02:00
  • fad160e0bc Fix posdb merge where when we try to merge lists containing empty list, we lose data Ai Lin Chia 2017-10-06 13:45:07 +02:00
  • ab16aa9aef Change PRId32 to %d Ai Lin Chia 2017-10-06 13:38:32 +02:00
  • c4ce3d33ee Added spiderdb->sqlite conversion command Ivan Skytte Jørgensen 2017-10-06 12:13:03 +02:00
  • c69273219d Check allowHighFrequencyTermCache flag before using g_htfs Ai Lin Chia 2017-10-05 15:56:18 +02:00
  • cff76bb5eb Fix check of content type str Ai Lin Chia 2017-10-05 15:03:18 +02:00
  • 6ffd8cab05 Fix issue where type:pdf is treated as high frequency term and ignored Ai Lin Chia 2017-10-05 15:02:34 +02:00
  • dd20bfd998 Add DocDelete by url feature Ai Lin Chia 2017-10-03 16:14:15 +02:00
  • a6a3623f04 Copy all required shared lib when doing make dist Ai Lin Chia 2017-10-04 11:16:20 +02:00
  • 16f81cf983 Now we strip /t/r/n before starting any decoding of url Ai Lin Chia 2017-10-03 11:17:06 +02:00
  • 8d4a168249 Code style changes Ai Lin Chia 2017-10-03 11:13:50 +02:00
  • c0cd29c7ae Fix conversion of idn tld with port Ai Lin Chia 2017-10-02 16:48:19 +02:00
  • 07e17d6b24 Suppressed coredumps on early exits Ivan Skytte Jørgensen 2017-10-02 16:12:24 +02:00
  • 71440e595e Removed effectively unused SpiderReply::m_wasIndexed and m_wasIndexedValid Ivan Skytte Jørgensen 2017-10-02 14:08:04 +02:00
  • 0910f3b93b Don't get too big of a chunk using Msg5 Ai Lin Chia 2017-10-02 12:25:32 +02:00
  • 9ea384708a Don't urlencode tab/cr/lf Ai Lin Chia 2017-10-02 12:18:22 +02:00