Commit Graph

  • 78a4cfe6da forgot to push the .h files Matt Wells 2013-12-07 22:12:48 -07:00
  • e1712fc94f fix uninitialized diffbot titlerec header parms. ignore them when not a custom crawl. Matt Wells 2013-12-07 22:11:26 -07:00
  • 06edfddf31 a bunch of bug fixes, mostly spider related. also some for pagereindex. Matt Wells 2013-12-07 21:56:37 -07:00
  • 5e4b5a112c Merge branch 'master' into diffbot Matt Wells 2013-12-07 11:34:26 -07:00
  • 105be1fbdc more core fixes Matt Wells 2013-12-07 10:38:47 -07:00
  • 8d92a079c2 minor spider error reply time fix Matt Wells 2013-12-07 10:21:51 -07:00
  • e731e5a4d8 Merge branch 'diffbot' of git@github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-07 10:21:21 -07:00
  • 0e846a9389 minor spider reply error fix Matt Wells 2013-12-07 10:21:02 -07:00
  • 626a97770c another core fix Matt Wells 2013-12-07 10:14:37 -07:00
  • fda7b48500 fix core Matt Wells 2013-12-07 10:11:13 -07:00
  • 1bc80ab552 fixed pagereindex. we now add spiderreplies for internal errors like ENOMEM or ENOTFOUND to try to avoid the "CRITICAL CRITICAL" msgs. these are considered temporary errors. Matt Wells 2013-12-07 10:01:17 -07:00
  • d9b31d3481 quick bug fix Matt Wells 2013-12-06 22:57:49 -07:00
  • 269c10f648 try to figure out why pagereindex never displayed html page when done. Matt Wells 2013-12-06 22:56:06 -07:00
  • 522e81913f another parm overhaul checkpoint mwells 2013-12-06 17:33:55 -08:00
  • adf9d807ea Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-12-06 12:31:36 -08:00
  • 08faf78be9 checkpoint for new parm logic to allowing syncing with newly added or deleted collections even if a host was dead when collection was added/deleted. also added parm change request queueing. mwells 2013-12-06 12:29:14 -08:00
  • e7bd904765 fix docids only printing. Matt Wells 2013-12-06 09:53:32 -07:00
  • c50ef1954f show admin controls on serps if ip is local. fixed up the "reindex" page for deleting/reindexing search results for a given query. Matt Wells 2013-12-06 09:48:30 -07:00
  • 4b3e111bed fix spider dumping to remember uh48's between list readings. was showing dups for www.nordicusa.com/webtv at the end. Matt Wells 2013-12-05 10:09:06 -08:00
  • 99cc10fccd allow seed urls to match url crawl pattern regardless. Matt Wells 2013-12-03 17:13:38 -08:00
  • 432099c4e6 added rebuild=true fix for regex crawl change Matt Wells 2013-12-03 16:23:58 -08:00
  • 2e46bcc97f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-03 16:23:20 -08:00
  • 03219a3057 add regex support back in Matt Wells 2013-12-03 16:23:05 -08:00
  • 6ab9041f45 fix bug when just getting the crawl parms was rebuilding the waiting tree. Matt Wells 2013-12-03 16:17:36 -08:00
  • 9f1d79b124 check for null collrec Matt Wells 2013-12-02 10:13:19 -08:00
  • cda5968b75 update common word list Matt Wells 2013-12-01 15:19:33 -07:00
  • 39f8dc646b default gigabits on for my copy. Matt Wells 2013-12-01 15:07:06 -07:00
  • 7f4dca7a07 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-12-01 14:47:16 -07:00
  • 7874c8d832 added ifdef NEEDSLICENSE Matt Wells 2013-12-01 14:47:08 -07:00
  • dfe72a76a0 Update LICENSE Gigablast 2013-12-01 13:43:14 -08:00
  • d43b55103c show query in msg20 log msg Matt Wells 2013-12-01 12:11:25 -07:00
  • 1077191e4a fix log msg bug. Matt Wells 2013-12-01 12:08:05 -07:00
  • 08030865e4 fix compiler warning Matt Wells 2013-12-01 11:57:26 -07:00
  • d811a13627 fix small oopsy Matt Wells 2013-12-01 11:56:33 -07:00
  • 3155869fbf added new log msg for recording cpu time for summary generation. Matt Wells 2013-12-01 11:53:41 -07:00
  • 5ee2be8fcf fixed data corruption bug. m_finalCrawlDelay was being stored in xmldoc titlerec header. Matt Wells 2013-11-27 14:18:15 -08:00
  • 1129e9b635 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-27 14:09:54 -08:00
  • 57eb231a4e do not add timestamps to lastdownload cache if skiphammercheck is true. those are like robots.txt or redirs or root files. Matt Wells 2013-11-26 14:21:17 -08:00
  • 0f3374e3f3 measure crawl delay by default from start of each download now. it is a parm in msg13request. Matt Wells 2013-11-26 14:07:28 -08:00
  • 4769ca0881 if pthread_create() returns EAGAIN then do not always retry, it makes an infinite loop. Matt Wells 2013-11-26 14:52:07 -07:00
  • 8bb086ac60 crawldelay works now but it measures from the end of the download, not the beginning. Matt Wells 2013-11-26 12:58:14 -08:00
  • 1c7c9a4d80 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-26 09:19:26 -08:00
  • 040bdb8039 fix url filters formulation. fixed extra , in json. fixed upp and ucp patterns if all substrings are negative. Matt Wells 2013-11-26 09:17:38 -08:00
  • ca544ddb90 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-25 15:06:11 -08:00
  • 1bbbcff755 fix getTokenizedDiffbotReply() to look for type: with a {} depth of 1 so it does not pick up on the type:image in the images array if there is one in the article. Matt Wells 2013-11-25 13:58:31 -08:00
  • 61ce4be279 fix major bug when you have twins/mirrors. queries not returning all the results. Matt Wells 2013-11-25 09:53:53 -07:00
  • 9a456de178 minor fix Matt Wells 2013-11-24 20:48:47 -07:00
  • 5da41cd113 fix a couple different cores. Matt Wells 2013-11-24 19:46:44 -07:00
  • 41ce557627 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 18:26:53 -08:00
  • e8065a0f0a enforce crawl delay perfectly. Matt Wells 2013-11-22 18:26:34 -08:00
  • 1826860094 forgot to add diffbot api url parm Matt Wells 2013-11-22 17:55:37 -08:00
  • f235a20752 add ! support to all patterns Matt Wells 2013-11-22 17:52:14 -08:00
  • c3517ee019 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 17:37:42 -08:00
  • bc251e17f5 hosts.conf fix Matt Wells 2013-11-22 14:18:03 -08:00
  • 791036aabb Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 14:17:34 -08:00
  • 3cc300bf03 spider log debug msg fix. boost max cpu threads to 10, seems to have many cores usually. Matt Wells 2013-11-22 14:17:10 -08:00
  • e0a15194e1 fix json double decoding issue. no more partial decodes, json parser stores fully decoded string into separate buf. Matt Wells 2013-11-22 14:16:14 -08:00
  • 6b36ddfd31 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-22 11:14:35 -08:00
  • 9d9a976b4f fix bug of perpetual round incrementing ad nauseam. Matt Wells 2013-11-22 11:14:03 -08:00
  • c8da2a5af7 fix core Matt Wells 2013-11-22 09:47:12 -07:00
  • 8a58969ab8 try to fix core. log redirects. Matt Wells 2013-11-22 00:41:33 -08:00
  • 79df39655f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-21 12:38:03 -08:00
  • 2a5d92a639 log debug update. Matt Wells 2013-11-21 12:37:53 -08:00
  • f4de986c7e test to make sure diffbot reply contains "url":" field. try to find out why some diffbot replies are truncated. Matt Wells 2013-11-21 12:37:08 -08:00
  • 14e2164acd oopsy Matt Wells 2013-11-20 23:40:30 -07:00
  • acac80d4a9 fix core in summary generation highlighting. Matt Wells 2013-11-20 23:38:28 -07:00
  • 4a83415832 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-11-20 16:44:41 -08:00
  • dcae4682e8 new api. tossed action/expression and added urlCrawlPattern/urlProcessPattern/apiUrl Matt Wells 2013-11-20 16:41:28 -08:00
  • 6f4508c8f1 fix issue of bulk job spidering links because of a simplified redirect. Matt Wells 2013-11-20 16:09:50 -08:00
  • 43e40208b8 Merge branch 'master' into diffbot Matt Wells 2013-11-20 15:51:58 -08:00
  • d2751211fe do not spider links in XmlDoc::spiderLinks() if its a custom bulk job. put in logIt() too. Matt Wells 2013-11-20 15:46:17 -08:00
  • 9489ce6832 now show json items in csv with aligned columns. use search requests as the way to export data now. Matt Wells 2013-11-20 10:45:10 -08:00
  • cbc1303a2a make performance table taller. we are losing graphical data still. Matt Wells 2013-11-20 10:10:40 -07:00
  • 5baf6a95d4 handle a bunch of oom conditions that caused core. found using oom tester. mwells 2013-11-20 10:14:02 -07:00
  • 46a683a904 label the bigger safebuf chunks of mem so we can see a better breakdown of mem on the stats page, not just a big "SafeBuf" allocation. mwells 2013-11-19 23:53:40 -07:00
  • ff8491c50e set g_errno in getCollRec() Matt Wells 2013-11-19 15:49:32 -08:00
  • b467f70782 fix hosts.conf Matt Wells 2013-11-19 15:41:03 -08:00
  • ec4d77f00a make waiting trees grow dynamically to save space. was taking like 1.5GB of ram for like 100 collections or so. Matt Wells 2013-11-19 15:23:25 -08:00
  • c669f8c138 fix file descriptor leak in Dir class. try to fix core from Thread getting SIGALRM. try to set NOFILES to 1024 at startup in case more are allowed. Matt Wells 2013-11-19 13:41:56 -08:00
  • 35d22bd9aa fix json parser Matt Wells 2013-11-19 09:44:42 -08:00
  • 879cd588e0 use -DPTHREADS not _PTHREADS_ Matt Wells 2013-11-19 00:49:43 -08:00
  • e909b85638 Merge branch 'master' into diffbot Matt Wells 2013-11-19 00:45:49 -08:00
  • 9c62ab362c Revert "use scp not rcp for administrative cmds" Matt Wells 2013-11-19 00:19:21 -07:00
  • 7490748139 errno test update mwells 2013-11-19 00:10:10 -07:00
  • cec1918fb9 minor change to errno tester comment mwells 2013-11-18 23:18:06 -07:00
  • 81a907f073 Merge branch 'master' of github.com:gigablast/open-source-search-engine mwells 2013-11-18 23:14:52 -07:00
  • 339fc9d1de added errno thread tester. mwells 2013-11-18 23:14:36 -07:00
  • 69df8df18e Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-11-18 22:38:20 -07:00
  • aa3a847f17 put some old stuff back until we figure out errno more. Matt Wells 2013-11-18 22:38:12 -07:00
  • 64910ee991 fix oops mwells 2013-11-18 22:32:00 -07:00
  • 4e71bc0698 use pthreads again until we can verify the stability of the new clone approach. Matt Wells 2013-11-18 22:23:38 -07:00
  • a8ffc6e50b indicate diffbot processing errors in the urls csv Matt Wells 2013-11-18 17:38:14 -08:00
  • 25dd764dac Merge branch 'master' into diffbot Matt Wells 2013-11-18 16:59:33 -08:00
  • 7d3b52fb3a if intersect thread takes forever was causing msg5 reads to block forever and spider round was getting incremented. fixed a few bugs around that issue. Matt Wells 2013-11-18 16:20:30 -08:00
  • 2e317df2d2 Merge branch 'testing' Matt Wells 2013-11-18 15:53:57 -07:00
  • f85a953a34 fix core dump Matt Wells 2013-11-18 15:53:30 -07:00
  • dbcf4630ff show crawl delay in current urls table Matt Wells 2013-11-18 14:31:01 -08:00
  • 8d9f000f11 make getNumSpidersOutPerIp() specific to a coll so another coll does not prevent a coll from popuating its own waiting tree. Matt Wells 2013-11-18 14:13:28 -08:00
  • 3df310d3ec take out -lpthread. don't need it. Matt Wells 2013-11-17 22:25:19 -07:00
  • cc1d117e55 use scp not rcp for administrative cmds like './gb installgb' most ppl do not have rcp on their system any more. Matt Wells 2013-11-17 20:49:38 -07:00