Commit Graph

  • db74af766b fix core in addExistingColl() mwells 2013-12-10 15:46:38 -0800
  • 82494baa89 move CollectionRec stuff into Collectiondb files for simplicity. mwells 2013-12-10 15:28:04 -0800
  • 14b0682d6b can't use safebuf in a thread. oops! Matt Wells 2013-12-10 14:20:44 -0700
  • 22271c0bb2 do not accept msg4 add requests until in sync with host 0 mwells 2013-12-10 13:20:23 -0800
  • f2d5661965 parmdb overhaul. support collection add/del sync when host comes back online. use udp not tcp. host #0 can now handle a new incoming request while a parm change is currently outstanding. all missed "command" parms will be received when a dead host comes back online, too, like a tight merge for instance. does not use msg4, uses msg3e and msg3f for syncing and sending parms. mwells 2013-12-10 13:09:55 -0800
  • 0e47d48d8c test commit mwells 2013-12-10 13:02:52 -0800
  • 1175478705 got this new parm shit compiling mwells 2013-12-10 12:54:19 -0800
  • 9e1976a8e2 new parm stuff almost compiling. mwells 2013-12-10 11:13:43 -0800
  • 6f6c4aed84 minor admin.html edit. Matt Wells 2013-12-10 10:39:38 -0700
  • 1a7d5e389b very minor admin.html edit Matt Wells 2013-12-10 00:56:56 -0700
  • ec2254d8ed added multi language support note to admin.html Matt Wells 2013-12-09 23:18:33 -0700
  • f7e7acb398 minor log msg updates. updated admin.html to give some performance and storage capacity info. Matt Wells 2013-12-09 23:16:24 -0700
  • 95bd6238d9 do not core when running filters when our gb home dir is really long. thanks bill! call XmlDoc::getSpiderPriority() with a SpiderReply so we can act on m_langId, like chinese, for instance, to filter those langs out from indexing. it was doing this before but got commented out for some reason. mwells 2013-12-09 22:55:02 -0700
  • cc63fd048f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-12-09 13:46:08 -0800
  • e04d596288 minor comments update. mwells 2013-12-09 13:42:33 -0800
  • 2a5d4beec4 fix core from last push. Matt Wells 2013-12-09 14:21:46 -0700
  • fa497de217 remove annoying log msg Matt Wells 2013-12-09 14:09:48 -0700
  • 44ae7c4de6 mem labelling fixes. fixed bad alloc when generating gigabits. Matt Wells 2013-12-09 14:05:02 -0700
  • 0dcd1211d3 new opensource icon. Matt Wells 2013-12-08 19:47:39 -0700
  • 92ec3f1148 added open source icon to homepage Matt Wells 2013-12-08 19:45:49 -0700
  • 92e3d841a6 minor update Matt Wells 2013-12-08 19:28:45 -0700
  • 12404b4f85 doc updates Matt Wells 2013-12-08 19:26:48 -0700
  • dd3b49faa9 collection name hell Matt Wells 2013-12-08 16:44:37 -0700
  • 3353a90a85 fix resuming a killed merge condition. Matt Wells 2013-12-08 15:50:45 -0700
  • ed79b67d2e core dump fixes Matt Wells 2013-12-08 15:36:23 -0700
  • 144e2c898e save resources by not doing reads on an empty doledb priority. stop saving allSpidersOn and Off parms. Matt Wells 2013-12-08 14:07:31 -0700
  • a2e52a5dc3 little fix Matt Wells 2013-12-08 10:15:54 -0700
  • 020d7741b9 new coll.conf for main with ismedia filter. updated url filters docs some more for "isnew" and explained the errorcount stuff more. Matt Wells 2013-12-08 10:10:51 -0700
  • 65e75167e3 limit posdb merging to 8 files max. added some more url filters documentation. Matt Wells 2013-12-08 09:41:05 -0700
  • 78a4cfe6da forgot to push the .h files Matt Wells 2013-12-07 22:12:48 -0700
  • e1712fc94f fix uninitialized diffbot titlerec header parms. ignore them when not a custom crawl. Matt Wells 2013-12-07 22:11:26 -0700
  • 06edfddf31 a bunch of bug fixes, mostly spider related. also some for pagereindex. Matt Wells 2013-12-07 21:56:37 -0700
  • 5e4b5a112c Merge branch 'master' into diffbot Matt Wells 2013-12-07 11:34:26 -0700
  • 105be1fbdc more core fixes Matt Wells 2013-12-07 10:38:47 -0700
  • 8d92a079c2 minor spider error reply time fix Matt Wells 2013-12-07 10:21:51 -0700
  • e731e5a4d8 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-07 10:21:21 -0700
  • 0e846a9389 minor spider reply error fix Matt Wells 2013-12-07 10:21:02 -0700
  • 626a97770c another core fix Matt Wells 2013-12-07 10:14:37 -0700
  • fda7b48500 fix core Matt Wells 2013-12-07 10:11:13 -0700
  • 1bc80ab552 fixed pagereindex. we now add spiderreplies for internal errors like ENOMEM or ENOTFOUND to try to avoid the "CRITICAL CRITICAL" msgs. these are considered temporary errors. Matt Wells 2013-12-07 10:01:17 -0700
  • d9b31d3481 quick bug fix Matt Wells 2013-12-06 22:57:49 -0700
  • 269c10f648 try to figure out why pagereindex never displayed html page when done. Matt Wells 2013-12-06 22:56:06 -0700
  • 522e81913f another parm overhaul checkpoint mwells 2013-12-06 17:33:55 -0800
  • adf9d807ea Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-12-06 12:31:36 -0800
  • 08faf78be9 checkpoint for new parm logic to allowing syncing with newly added or deleted collections even if a host was dead when collection was added/deleted. also added parm change request queueing. mwells 2013-12-06 12:29:14 -0800
  • e7bd904765 fix docids only printing. Matt Wells 2013-12-06 09:53:32 -0700
  • c50ef1954f show admin controls on serps if ip is local. fixed up the "reindex" page for deleting/reindexing search results for a given query. Matt Wells 2013-12-06 09:48:30 -0700
  • 4b3e111bed fix spider dumping to remember uh48's between list readings. was showing dups for www.nordicusa.com/webtv at the end. Matt Wells 2013-12-05 10:09:06 -0800
  • 99cc10fccd allow seed urls to match url crawl pattern regardless. Matt Wells 2013-12-03 17:13:38 -0800
  • 432099c4e6 added rebuild=true fix for regex crawl change Matt Wells 2013-12-03 16:23:58 -0800
  • 2e46bcc97f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-03 16:23:20 -0800
  • 03219a3057 add regex support back in Matt Wells 2013-12-03 16:23:05 -0800
  • 6ab9041f45 fix bug when just getting the crawl parms was rebuilding the waiting tree. Matt Wells 2013-12-03 16:17:36 -0800
  • 9f1d79b124 check for null collrec Matt Wells 2013-12-02 10:13:19 -0800
  • cda5968b75 update common word list Matt Wells 2013-12-01 15:19:33 -0700
  • 39f8dc646b default gigabits on for my copy. Matt Wells 2013-12-01 15:07:06 -0700
  • 7f4dca7a07 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-12-01 14:47:16 -0700
  • 7874c8d832 added ifdef NEEDSLICENSE Matt Wells 2013-12-01 14:47:08 -0700
  • dfe72a76a0 Update LICENSE Gigablast 2013-12-01 13:43:14 -0800
  • d43b55103c show query in msg20 log msg Matt Wells 2013-12-01 12:11:25 -0700
  • 1077191e4a fix log msg bug. Matt Wells 2013-12-01 12:08:05 -0700
  • 08030865e4 fix compiler warning Matt Wells 2013-12-01 11:57:26 -0700
  • d811a13627 fix small oopsy Matt Wells 2013-12-01 11:56:33 -0700
  • 3155869fbf added new log msg for recording cpu time for summary generation. Matt Wells 2013-12-01 11:53:41 -0700
  • 5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay was being stored in xmldoc titlerec header. Matt Wells 2013-11-27 14:18:15 -0800
  • 1129e9b635 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-27 14:09:54 -0800
  • 57eb231a4e do not add timestamps to lastdownload cache if skiphammercheck is true. those are like robots.txt or redirs or root files. Matt Wells 2013-11-26 14:21:17 -0800
  • 0f3374e3f3 measure crawl delay by default from start of each download now. it is a parm in msg13request. Matt Wells 2013-11-26 14:07:28 -0800
  • 4769ca0881 if pthread_create() returns EAGAIN then do not always retry, it makes an infinite loop. Matt Wells 2013-11-26 14:52:07 -0700
  • 8bb086ac60 crawldelay works now but it measures from the end of the download, not the beginning. Matt Wells 2013-11-26 12:58:14 -0800
  • 1c7c9a4d80 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-26 09:19:26 -0800
  • 040bdb8039 fix url filters formulation. fixed extra , in json. fixed upp and ucp patterns if all substrings are negative. Matt Wells 2013-11-26 09:17:38 -0800
  • ca544ddb90 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-25 15:06:11 -0800
  • 1bbbcff755 fix getTokenizedDiffbotReply() to look for type: with a {} depth of 1 so it does not pick up on the type:image in the images array if there is one in the article. Matt Wells 2013-11-25 13:58:31 -0800
  • 61ce4be279 fix major bug when you have twins/mirrors. queries not returning all the results. Matt Wells 2013-11-25 09:53:53 -0700
  • 9a456de178 minor fix Matt Wells 2013-11-24 20:48:47 -0700
  • 5da41cd113 fix a couple different cores. Matt Wells 2013-11-24 19:46:44 -0700
  • 41ce557627 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 18:26:53 -0800
  • e8065a0f0a enforce crawl delay perfectly. Matt Wells 2013-11-22 18:26:34 -0800
  • 1826860094 forgot to add diffbot api url parm Matt Wells 2013-11-22 17:55:37 -0800
  • f235a20752 add ! support to all patterns Matt Wells 2013-11-22 17:52:14 -0800
  • c3517ee019 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 17:37:42 -0800
  • bc251e17f5 hosts.conf fix Matt Wells 2013-11-22 14:18:03 -0800
  • 791036aabb Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 14:17:34 -0800
  • 3cc300bf03 spider log debug msg fix. boost max cpu threads to 10, seems to have many cores usually. Matt Wells 2013-11-22 14:17:10 -0800
  • e0a15194e1 fix json double decoding issue. no more partial decodes, json parser stores fully decoded string into separate buf. Matt Wells 2013-11-22 14:16:14 -0800
  • 6b36ddfd31 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 11:14:35 -0800
  • 9d9a976b4f fix bug of perpetual round incrementing ad nauseam. Matt Wells 2013-11-22 11:14:03 -0800
  • c8da2a5af7 fix core Matt Wells 2013-11-22 09:47:12 -0700
  • 8a58969ab8 try to fix core. log redirects. Matt Wells 2013-11-22 00:41:33 -0800
  • 79df39655f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-21 12:38:03 -0800
  • 2a5d92a639 log debug update. Matt Wells 2013-11-21 12:37:53 -0800
  • f4de986c7e test to make sure diffbot reply contains "url":" field. try to find out why some diffbot replies are truncated. Matt Wells 2013-11-21 12:37:08 -0800
  • 14e2164acd oopsy Matt Wells 2013-11-20 23:40:30 -0700
  • acac80d4a9 fix core in summary generation highlighting. Matt Wells 2013-11-20 23:38:28 -0700
  • 4a83415832 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-20 16:44:41 -0800
  • dcae4682e8 new api. tossed action/expression and added urlCrawlPattern/urlProcessPattern/apiUrl Matt Wells 2013-11-20 16:41:28 -0800
  • 6f4508c8f1 fix issue of bulk job spidering links because of a simplified redirect. Matt Wells 2013-11-20 16:09:50 -0800
  • 43e40208b8 Merge branch 'master' into diffbot Matt Wells 2013-11-20 15:51:58 -0800
  • d2751211fe do not spider links in XmlDoc::spiderLinks() if its a custom bulk job. put in logIt() too. Matt Wells 2013-11-20 15:46:17 -0800